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Abstract 


In  a  single  instruction-stream/mxiltiple  data-stream  (SIMD)  computer,  calculations  etre  performed 
by  simple  processing  elements  (PEs)  that  are  not  independently  capable  of  program-control  opera¬ 
tions.  In  lock-step,  the  PEs  execute  one  program  that  is  sequenced  by  a  single  system  controller. 
Large  numbers  of  these  simple  PEs  are  obtained  through  repHcation  of  a  PE  chip  containing  many 
identical  PEs. 

A  state-of-the-art  SIMD  computer  is  regulated  by  a  single  system  clock  that  is  distributed  through¬ 
out  the  computer.  On  each  system  clock  cycle,  the  system  controller  broadcasts  the  next  instruction 
to  be  executed  by  the  PEs.  The  system  clock  interval  allows  time  to  distribute  a  PE  instruction 
throughout  the  computer,  an  action  that  typically  requires  more  time  than  the  minimum  interval  of 
a  clock  regulating  the  PEs  themselves  within  the  PE  chips.  The  disparity  between  the  highest  rate  of 
PE  operation  and  the  rate  of  global  instruction  broadcast  gives  rise  to  a  heretofore  un-compensated 
clock-rate  limitation. 

To  overcome  this  limitation,  instruction-cached  SIMD  computer  architecture  provides  for  a  small 
instruction  buffer  to  be  placed  within  the  replicated  PE  chip.  This  buffer  stores  repeated  instruction 
sequences  for  subsequent  retrieval  at  the  relatively  high  rate  attainable  within  the  PE  chip.  The 
instruction  buffer  and  its  control  mechanism  comprise  a  SIMD  instruction  cache,  or  I-cache. 

Throughput  measures  computer  performance,  while  total  chip  area  is  a  primitive  measure  of  a 
computer’s  monetary  cost  that  is  most  appropriate  for  high-PE-count  multiprocessors.  'The  ratio 
of  throughput  to  area  is  a  figure  of  merit  expressing  a  multiprocessor’s  ratio  of  performance  to 
hardware  cost.  My  thesis  is  that  even  the  simplest  I-cache  variants  increase  throughput-to-area 
ratios  significantly. 

I-cache  increases  the  rate  at  which  instructions  are  delivered  to  the  PEs.  However,  I-cache  takes 
up  space  in  the  PE  chip  that  would  otherwise  have  been  used  for  PEs  themselves.  If  the  total  chip 
area  of  the  computer  is  fixed,  then  I-cache  reduces  the  chip  area  used  for  PEs.  For  scalable  data- 
paraUel  problems,  reducing  the  niunber  of  PEs  reduces  the  throughput.  Therefore,  the  magnitude  of 
the  throughput-to-area  ratio  impact  of  I-cache  depends  on  the  balance  between  the  competing  factors 
of  instruction  execution-rate  increase  and  displacement  of  PEs  fi:Y>m  the  PE  chip. 

In  analyzing  SIMD  instruction  cache,  the  dissertation  considers  the  interacting  characteristics 
of  data-paraUel  program  properties  and  SIMD  computer  electrical  characteristics  in  I-cache  design. 
Designs  of  two  simple  I-cache  variants  are  presented  and  their  chip-area  costs  are  estimated.  For 
a  diversity  jf  data-parallel  problems,  these  I-caches  are  evaluated  over  a  range  of  underlying  PE 
architecture  and  system  characteristics.  The  evaluations  are  performed  using  register-transfer-level 
simulations  of  SIMD  computations  on  a  clock-phase  by  clock-phase  basis. 

The  simple  I-cache  variants  occupy  negligible  chip  area  and  yet  yield  speedups  ranging  firom  1.3 
to  7.6  for  the  simple  problems  under  realistic  assumptions  about  the  electrical  characteristics  of  the 
SIMD  computer.  The  computations  for  which  these  results  are  obtained  include  such  problems  as 
sorting  and  tree-reduction  which  are  commonly  thought  to  be  inherently  inter-PE-commimication- 
bound  because  of  their  characteristic  fi:«quency  and  complexity  of  inter-PE  communication.  The 
simulation  results  suggest  that  there  is  a  wide  range  of  problem  characteristics  over  which  I-cache 
makes  for  the  best  use  of  chip  area  in  a  SIMD  computer. 
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Chapter  1 

Introduction 


Some  computational  problems  demand  the  fastest  possible  computers.  The  two  main  technologies 
used  for  attaining  maximum  throughput  are  VLSI  and  parallelism.  VLSI  is  a  fabrication  technol¬ 
ogy  that  exploits  the  inherent  speed  and  cost  advantages  arising  from  packing  computer  system 
components  together  inside  chips.  Parallelism  is  an  organization  technology  that  exploits  the  speed 
advantages  of  harnessing  the  collective  efforts  of  large  numbers  of  computational  imits.  These  com¬ 
putational  units  are  processing  elements  (or  PEs),  each  of  which  is  physically  small  enough  to  fit 
inside  a  modem  VLSI  chip. 

The  inherent  perfonnance  and  cost  advantages  of  placing  systems  within  chips  impel  both  re¬ 
ductions  in  transistor  dimensions  and  increases  in  the  physical  sizes  of  chips.  The  size  of  a  chip,  as 
measured  in  the  niunber  of  transistors  it  contains,  increases  over  time. 

Parallel  computers  containing  large  numbers  of  coordinated  PEs  are  the  fastest  computers  for 
some  problems.  The  data-parallel  problems  [41]  constitute  one  class  of  computational  problem  for 
which  the  solution  speed  is  roughly  proportional  to  the  number  of  PEs.  For  a  scalable  data-paraUel 
problem,  the  fastest  parallel  computer  is  that  containing  the  greatest  number  of  PEs,  all  else  being 
equal. 

There  is  an  interaction  between  VLSI  and  parallelism  that  warrants  caution  on  the  part  of 
the  computer  architect:  VLSI  implementation  technique  imposes  an  upper  limit  on  the  number 
of  PEs  that  may  be  packaged  together  in  a  chip.  The  compromise  with  respect  to  this  technical 
reality  is  to  decompose  the  target  system  into  modules  of  manageable  size,  eadi  containing  as  many 
PEs  as  will  fit  on  one  chip  when  accompanied  by  whatever  circuitry  may  be  needed  to  facilitate 
the  decomposition.  Coordinating  large  numbers  of  PEs  requires  inter-chip  communication,  which 
t3q)ically  occurs  at  a  lower  rate  than  communication  within  chips.  The  speed  advantages  of  parallel 
computers  are  tempered  by  the  extent  to  which  the  activity  of  a  PE  can  no  longer  be  confined 
to  a  single  chip.  Coordination  among  PEs,  access  by  PEs  to  non-integral  storage  or  functional 
components,  and  movement  of  input  and  output  data  sets  to  and  firom  the  PEs,  all  incur  significant 
speed  penalties.  These  speed  penalties  tend  to  lessen  as  VLSI  implementation  technique  improves, 
because  an  increasing  firaction  of  the  total  activity  occurs  at  the  relatively  high  rates  achievable 
within  the  confines  of  a  chip. 

This  dissertation  addresses  managing  one  interaction  between  VLSI  and  parallelism  in  the  design 
of  Single  Instruction-stream/Multiple  Data-stream  (or  SIMD)  computers.  A  SIMD  computer  is  a 
parallel  computer  whose  PEs  are  as  simple  as  possible,  so  as  to  occupy  as  little  chip  area  as  possible. 
Because  SIMD  computer  architectiue  packs  a  maximum  number  of  PEs  into  the  available  total  chip 
area,  one  might  expect  a  SIMD  computer  to  be  among  the  fastest  possible  for  data-parallel  problems. 
Unfortunately,  generic  SIMD  computer  architecture  introduces  a  throughput  limitation  by  req\iiring 
that  a  new  instruction  be  broadcast  to  the  PEs  on  each  dock  cyde.  Instruction  broadcast  to  large 
numbers  of  chips  occurs  at  a  lower  rate  than  the  PEs’  maximum  rate  of  intra-chip  operation. 
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Of  the  many  cost  factors  juggled  by  computer  designers,  including  monetary  cost,  power  consump¬ 
tion,  size,  and  cooling,  total  chip  area  is  not  often  the  most  important.  However,  there  are  appUcation 
contexts  wherein  making  the  best  use  of  chip  area  is  very  important.  Making  good  use  of  chip  area  is 
paramount  where  the  greatest  throughput  is  sought  for  a  given  implementation  budget  with  respect 
to  total  chip  area.  Examples  of  this  type  of  application  include  physical  simulations,  scientific  models, 
and  commercial  data  analysis.  Alternatively,  it  is  important  to  make  the  best  use  of  chip  area  where 
the  physical  size  of  the  computer  is  limited.  Examples  of  computer  systems  for  which  size  is  critical 
include  portable  computers,  commodity  signal  processors,  and  embedded  systems. 

Because  of  its  chip-area  parsimony,  generic  SIMD  computer  architecture  would  be  appealing,  if 
not  for  its  inherent  throughput  limitation.  The  importance  of  making  good  use  of  chip  area  leads  one 
to  wonder:  Is  it  possible  to  re-formiilate  SIMD  computer  architecture  slightly,  without  relinquishing 
all  of  the  chip-area  advantage,  to  allow  the  PEs  to  operate  at  their  highest  rate? 

The  stakes  are  high  enough  for  the  answer  to  this  question  to  be  interesting,  because  while  a  PE 
in  a  SIMD  computer  occupies  between  20%  and  50%  of  the  chip  area  of  a  microprocessor  with  the 
same  calcxilation  components,  PEs  in  existing  SIMD  computers  operate  at  clock  rates  8  to  12  times 
below  the  maximum  achievable  using  their  VLSI  implementation  techniques.  While  a  factor  of  up 
to  5  decrease  in  total  chip  area  may  be  compeUing,  smrrendering  a  factor  of  8  in  operation  rate  to 
achieve  it  is  a  poor  exchange  where  speed  is  the  ultimate  objective. 

SIMD  instruction  cache  (or  1-cache),  introduced  in  this  dissertation,  is  an  explicitly  managed  in¬ 
struction  b\i£fer  added  to  the  PE  chips  of  a  SIMD  computer.  Globally  broadcast  instruction  sequences 
that  are  to  be  repeated  are  stored  in  I-cache  for  subsequent  re-broadcast  at  the  high  rate  attainable 
within  the  PE  chip.  Whereas  SIMD  computer  architectiure  simplifies  PEs  by  eliminating  program 
control  that  is  redxindant  for  some  problems,  adding  I-cache  re-introduces  some  redxmdant  program 
control  into  the  PE  chip.  How  much  chip  area  does  the  re-introduced  program  control  occupy?  Fur¬ 
ther,  by  how  much  does  I-cache  increase  throughput?  Clearly,  the  answers  to  these  questions  depend 
on  the  characteristics  of  computational  problems,  of  PE  architecture,  of  VLSI  implementation  tech¬ 
nique,  and  of  the  target  computer's  decomposition  into  chips.  The  analysis  of  I-cache  reveals  how 
these  characteristics  interact.  Detailed  evaluations  of  the  simplest  I-cache  variants  over  a  set  of 
diverse  sample  problems  show  that  throughput  increases  are  two  to  three  orders  of  magnitude  larger 
than  the  concomitant  chip-area  cost  of  I-cache  in  a  modem  PE  chip. 


1.1  The  Thesis 

The  ratio  of  throughput  to  total  chip  area  is  a  useful  computation  metric  in  a  world  of  finite  re¬ 
sources:  For  a  given  total  chip  area  used  in  a  computer,  a  higher  throughput-to-area  ratio  implies 
higher  throughput.  Equivalently,  for  a  given  required  throughput,  a  higher  throughput-to-area  ratio 
implies  lower  total  chip  area.  For  scalable  data-parallel  problems,  high  throughput-to-area  ratios  are 
exploitable  either  as  high  throughput  or  as  low  chip  area,  depending  upon  application  requirements. 
Throughput-to-area  ratio  is  an  especially  important  metric  in  designing  computers  for  problems 
demanding  the  highest  possible  throughput  given  a  limited  implementation  budget  with  respect  to 
total  chip  area. 

This  section  introduces  my  thesis  by  providing  the  backgroimd  for  understanding  the  throughput- 
to-area  ratio  consequences  of  I-cache.  Figure  1.1  illustrates  a  sequence  of  improvements  to  computer 
architecture  leading  firom  generic  uniprocessors  to  maximiim-throughput  programmable  VLSI-based 
multiprocessors.  Each  step  in  the  sequence  increases  the  throughput  or  decreases  the  cost  of  the 
computer.  Each  step  is  applicable  only  to  a  subset  of  the  computations  at  that  step,  so  the  domain 
of  problems  narrows  as  successive  steps  are  taken.  This  particular  sequence  describes  one  path 
towards  the  goal  of  fast,  inexpensive  computations,  rather  than  all  such  paths. 

The  last  two  steps  in  the  sequence  shown  in  Figure  1.1  are  new,  resulting  in  I-cached  SIMD 
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computers.  My  thesis  is  that  significant  speedups  are  exhibited  even  for  simple  I-cache  variants  over 
a  broad  range  of  data-parallel  computations. 

The  first  step  in  Figure  1  1  characterizes  the  advent  of  the  microprocessor  in  the  1970s  [76], 
whereupon  it  beceime  possible  to  place  substantial  parts  of  all  of  the  main  subsystems  of  a  computer 
on  a  single  chip,  thus  enabling  appreciably  fast  computation  at  a  relatively  low  cost. 

The  second  step  in  Figure  1.1  reflects  the  possibility  of  multiprocessor  computers  being  faster, 
though  of  coiirse  more  expensive,  than  their  single-processor  counterparts.  This  step  can  be  taken 
for  computations  that  admit  parallel  solutions.  It  is  interesting  historically  to  note  that  the  second 
step  shown  in  Figure  1.1  was  in  fact  applied  first,  in  the  construction  of  non- VLSI  miiltiprocessors 
including  Solomon  [74]  and  ILLIAC  IV  [3]  in  the  1960s.  This  variance  with  historical  progress 
underscores  the  fact  that  the  sequence  represented  in  Figure  1 .1  is  not  prescriptive  of  the  development 
of  inexpensive  fast  computers,  but  rather  merely  suggestive  of  one  path  leading  in  that  direction. 

The  most  general  form  of  parallel  computer  is  a  Multiple  Instruction-stream/Multiple  Data- 
stream  (or  MIMD)  computer,  wherein  each  PE  comprises  counterparts  of  the  uniprocessor’s  calciila- 
tion  and  program-control  components,  adapted  for  inter-PE  communication.  It  is  increasingly  com¬ 
mon  for  MIMD  computers  to  contain  microprocessors  and  memory  chips  in  their  replicated  building 
blocks.  Because  they  are  used  in  a  very  large  number  of  appHcations,  microprocessors  and  memory 
chips  are  manufactured  in  high  volumes.  High-volume  manijfactiire  typically  leads  to  low  monetary 
cost.  Low  monetary  PE  cost  underlies  the  poprilaiity  of  this  style  of  MIMD  computer  design. 

The  common  parallel-programming  practice  of  replicating  a  single  program  in  all  PEs  of  a  MIMD 
computer  is  called  Single  Program/Multiple  Data-stream  (or  SPMD)  programming.  Although  SPMD 
programming  is  a  method-of-use  specialization  of  a  MIMD  computer  rather  than  an  improvement  to 
the  computer  itself  the  simplicity  of  SPMD  in  some  cases  reduces  the  programming  costs  associated 
with  multiprocessor  computation.  This  step  can  be  taken  only  for  computations  solving  data-paraUel 
problems. 

The  next  step  in  the  sequence  shown  in  Figure  1 .1  can  be  followed  for  SPMD  computations  wherein 
aU  PEs’  executions  of  the  single  program  happen  to  proceed  in  a  common  sequence.  In  these  cases, 
the  replication  of  program  storage  and  program  sequencing  in  every  PE  is  redundant.  Eliminating 
the  redundant  program  control  from  the  PE,  so  that  instead  a  system  controller  provides  a  single 
sequence  of  instructions  for  all  PEs  via  a  global  instruction  broadcast  network,  significantly  reduces 
the  chip  area  occupied  by  a  PE.  This  control  sharing  characterizes  SIMD  computer  architectiire. 

In  general,  a  SIMD  computer  is  less  convenient  to  program  than  a  MIMD  computer,  because  of 
the  requirement  that  all  PEs  receive  a  common  sequen<%  of  instructions.  This  inconvenience  restricts 
the  class  of  problems  for  which  SIMD  computers  are  appropriate.  Primarily,  these  problems  are  the 
scalable  data-parallel  problems.  The  low  production  volume  of  SIMD  PE  chips  makes  the  monetary 
fabrication  cost  per  chip  characteristically  high.  A  consequence  of  high  PE  chip  cost  is  that  the 
applications  for  which  SIMD  computers  are  practical  are  further  restricted  to  those  in  which  making 
the  best  use  of  chip  area  is  paramount. 

The  shading  of  the  “generic  SIMD  computer’’  box  in  Figure  1.1  indicates  that  it  is  the  starting 
point  for  the  modifications  suggested  in  this  dissertation. 

Inter-chip  wire  delays  tend  to  be  large  compared  to  intra-chip  wire  delays.  The  broadcast  of  a  new 
instruction  every  clock  cycle  firom  a  central  repository,  as  occurs  in  a  generic  SIMD  computer,  means 
that  every  instruction’s  execution  involves  inter-chip  si^aling.  Therefore,  instruction  execution 
proceeds  at  a  relatively  low  inter-chip  signaling  rate.  Generic  SIMD  computers  sufier  fimm  an 
inherent  instruction  delivery-rate  limitation. 

The  maximum  operation  rates  of  a  SIMD  computer’s  subsystems  are  determined  principally  by 
VLSI  implementation  technique  and  by  the  electrical  propagation  characteristics  of  inter-chip  wires. 
The  next  step  in  Figure  1.1  is  to  adapt  a  generic  SIMD  computer  so  that  clocks  are  available  to 
regulate  the  various  subsystems,  including  the  PEs,  each  at  its  maximum  rate.  Such  a  computer  is  a 
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multi<lock  SIMD  computer.  A  multi-dock  SIMD  computer  might  incorporate  a  clock-rate  multiplier 
in  the  PE  chip  resembling  the  high-rate  dock  generators  that  are  increasingly  commonly  used  in 
microprocessors  [5,  88].  A  multi-dock  generator  provides  a  PE  clock  within  the  PE  chip. 

Note  that  in  a  multi-dock  SIMD  computer,  even  when  the  rate  of  the  PE  dock  exceeds  that  of 
the  clock  regulating  global  instruction  broadcast,  PEs  still  receive  instructions  at  the  relatively  low 
instruction  broadcast  rate.  I-cache  is  one  means  to  overcome  the  instruction  dehvery-rate  limitation. 
I-cache  is  an  expUdtly  managed  instruction  buffer  inside  the  PE  chip.  In  an  I-cached  SIMD  computer, 
repeated  instruction  sequences  are  stored  within  the  PE  chip  for  subsequent  retrieval  at  the  relatively 
high  PE  clock  rate. 

While  it  is  reasonable  to  expect  an  I-cached  SIMD  computer  to  be  at  least  somewhat  faster  than  a 
generic  SIMD  computer,  it  is  not  dear  a  priori  just  how  costly  are  multi -clocking  and  I-cache.  Indeed, 
some  computer  sdentists  might  reasonably  object  to  the  improvement  daimed  for  the  final  two  steps 
in  Figure  1 .1 :  How  can  multiple  docks  and  I-cache  speed  up  SIMD  computation,  which  we  suspect 
to  be  inherently  limited  in  the  rates  of  inter-PE  communication  or  local  external  memory  access? 
Such  objections  may  be  lodged,  and  indeed  need  to  be  addressed,  with  respect  to  every  subsystem  of 
a  SIMD  computer  in  relation  to  the  global  instruction  broadcast  subsystem,  which  is  the  focus  of  the 
thesis. 

This  dissertation  provides  evidence  that  adding  I-cache  is  very  attractive:  while  occupying  less 
than  1%  of  the  area  of  a  modem  PE  chip,  I-cache  significantly  increases  the  throughput  of  a  diversity 
of  data-parallel  computations.  Detailed  simulations  of  simple  I-cache  variants  reveal  that  the  I-cache 
tradeoff  has  many  facets,  including  the  following: 


•  I-cache  speedups  depend  strongly  on  the  relative  clock  rates  of  the  computer’s  subsystems. 
Where  the  disparity  between  P£  dock  rate  and  global  instruction  broadcast  rate  is  a  factor 
of  8,  these  simple  I-cache  variants  yield  speedups  ranging  between  factors  of  1.3  and  7.9  for  a 
diverse  set  of  sample  programs.  For  sample  programs  with  simple  loop  structure,  simple  I-cache 
variants  do  nearly  as  well  as  possible.  Some  of  the  sample  programs  have  more  complicated 
loop  structures,  for  example  wherein  repeated  instruction  sequences  alternate  execution.  These 
programs  with  complex  loop  structiu^s  demand  more  complex  I-cache  variants. 


•  Surprisingly,  the  interaction  between  I-cache  speedup  and  PE  chip  pin  time-sharing  for  multi¬ 
chip  subsystems  depends  on  the  relative  dock  rates  of  the  subsystems.  When  a  multi-chip 
subsystem’s  dock  rate  is  as  low  as  the  instruction  broadcast  rate,  I-cache  speedup  decreases 
with  the  degree  of  PE  chip  pin  time-sharing.  However,  if  the  subsystem’s  clock  rate  is  higher 
than  the  instruction  broadcast  rate,  I-cache  speedup  in  fact  increases  with  the  degree  of  PE 
chip  pin  time-sharing.  I-cache  acts  in  some  cases  to  reduce  communication  bottlenecks  that 
typically  occur  at  chip  boxmdaries. 


•  Appropriate  management  of  I-cache  hzis  important  consequences  for  speedup.  For  some  sample 
problems,  straightforward  static  management  of  I-cache  3delds  near-maximum  speedups.  Other 
sample  problems,  for  example  those  wherein  loop  iterations  are  data-dependent  or  loop-index- 
dependent,  demand  dynamic  cache  management  mechanisms. 


•  A  variety  of  strategies  for  providing  chip  area  for  I-cache  in  a  PE  chip  of  fixed  size  are  presented. 
Empirical  evaluations  of  these  strategies  indicate  that  while  no  one  strategy  works  best  in  all 
cases,  it  is  very  likely  that  small  I-cache  can  be  accommodated  with  slight  impact  on  the 
operational  structxure  of  a  SIMD  computation. 
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1 .2  The  Structure  of  the  Dissertation 

The  reader  may  currently  be  using  (or  thinking  about  using)  data-parallel  computation  to  solve  a 
particular  problem.  Such  a  reader  is  likely  interested  in  learning  how  much  faster  an  I-cached  SIMD 
computer  would  be  than  its  generic  counterpart  for  that  problem,  and  at  what  cost.  Unfortunately, 
there  is  a  huge  number  of  cases,  and  the  dissertation  does  not  provide  a  closed-form  anal3rtic  tradeoff 
equation.  The  dissertation  does  characterize  a  heretofore  unexplored  region  of  the  design  space 
for  the  fastest  possible  computers  for  scalable  data-paraUel  problems.  This  characterization  takes 
the  form  of  analysis  and  evaluation,  grounded  in  detailed  examples,  that  elucidate  the  issues  and 
tradeoffs  relating  to  I-cache. 

Chapter  2  presents  the  essential  facets  of  the  I-cache  idea.  The  instruction  dehvery-rate  limitation 
of  SIMD  computers  is  developed  in  a  qualitative  manner,  and  I-cache  is  proposed  as  a  means  to 
surmount  that  limitation.  Although  the  dissertation  focusses  specifically  on  I-cache  added  to  SIMD 
computers’  PE  chips,  the  central  problem  of  distributing  instructions  through  a  relatively  slow 
channel  to  many  PEs  arises  also  in  loading  programs  into  the  PEs  of  MIMD  computers.  The  use  of 
the  term  "I-cache”  to  describe  an  explicitly  managed  PE  chip  instruction  buffer  may  surprise  those 
who  are  familiar  with  the  direct-mapped  instruction  caches  ordinarily  used  in  microprocessors.  A 
clear  distinction  is  drawn  between  SIMD  instruction  cache  and  ordinary  instruction  cache.  A  space 
of  possible  I-cache  designs  is  then  painted  in  broad  strokes,  to  provide  a  framework  in  which  to 
consider  the  mechanisms  and  functions.  To  iUiistrate  the  essential  functions  of  I-cache,  an  example 
program  for  a  SIMD  computation  is  presented  and  its  I-cache  speedup  is  calculated  and  discussed. 
Chapter  2  reveals  the  complexity  of  the  interactions  between  I-cache  and  the  various  characteristics 
of  computations,  so  Chapter  2  motivates  the  detailed  analysis  to  follow. 

Chapter  3  provides  the  definition  of  a  SIMD  computer  that  is  the  starting  point  in  the  analysis 
of  I-cache.  Chapter  3  introduces  the  SIMD  computer's  components  and  their  general  functions.  A 
small  set  of  abstractions  are  incorporated  in  the  model  of  a  generic  SIMD  computer  that  allow  the 
model  to  encompass  a  broad  range  of  existing  and  foreseeable  SIMD  computers.  A  detailed  example 
of  a  SIMD  computation  grounds  the  discussion.  I-cache  is  added  to  the  local  controller  of  a  PE  chip. 
To  provide  a  basis  for  assessing  the  chip-area  impact  of  I-cache,  a  physical  model  for  the  PE  chip 
is  presented.  A  practical  estimate  for  the  disparity  between  PE  clock  rate  and  global  instruction 
broadcast  rate  is  developed  which  shows  that  it  is  very  likely  for  the  disparity  to  exceed  a  factor  of 
2.  Chapter  3  concludes  with  a  consideration  of  the  options  for  overcoming  the  throughput  limitation 
that  arises  fix>m  that  disparity  in  generic  SIMD  computers. 

Chapter  4  describes  the  I-cache  design  elements  and  introduces  a  family  of  related  single-port 
I-cache  variants.  The  chip  area  occupied  by  members  of  this  family  is  estimated.  These  I-cache 
variants  occupy  as  little  as  1%  of  the  chip  area  originally  occupied  by  PEs  in  a  PE  chip  made 
with  present-day  VLSI  implementation  technique.  Interactions  of  I-cache  speedup  with  program 
properties  are  considered,  as  are  interactions  with  electrical  characteristics  of  the  computer.  These 
considerations  provide  clues  as  to  what  effects  to  look  for  in  the  empirical  evaluations  of  I-cache. 
Finally,  the  I-cache  management  problem  is  discussed  in  detail,  illustrated  with  examples  for  the 
family  of  I-cache  variants. 

Chapter  5  describes  the  empirical  method  used  to  measure  throughput  of  SIMD  computations. 
Measurements  are  taken  fix)m  register-transfer-level  simulations  of  a  basis  computer  that  is  param¬ 
eterized  to  represent  a  broad  range  of  generic,  multi-clock,  and  I-cached  SIMD  computers.  Chapter  5 
also  describes  the  set  of  SIMD  computer  variants  on  which  speedups  are  measured  and  the  set 
of  sample  problems  for  which  evaluations  are  performed.  Each  simulated  computation  comprises 
a  problem  running  on  a  SIMD  computer  variant  assumed  to  contain  an  inter-PE  communication 
network  topology  appropriate  for  the  problem.  After  giving  an  overview  of  I-cache  speedups  for 
the  various  problems.  Chapter  5  presents  summaries  of  the  measiired  speedups.  The  measured 
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speedups  are  compared  against  estimates  for  the  greatest  possible  single-port  I-cache  speedups,  and 
the  chapter  concludes  with  measurements  that  reveal  the  sensitivities  of  I-cache  speedups  to  the 
intensiveness  with  which  programs  utilize  the  SIMD  computer’s  multi-chip  subsystems. 

The  I-cache  evaluations  in  Chapter  5  are  performed  under  the  assumption  that  any  additional 
chip  area  needed  for  I-cache  is  provided  without  changing  the  operationzd  structure  of  the  subject 
computation.  Such  would  be  the  case,  for  example,  if  the  physical  size  of  the  PE  chip  were  increased  to 
accommodate  I-cache.  Chapter  6  considers  alternative  strategies  for  providing  the  chip  eurea  occupied 
by  I-cache,  under  the  realistic  assumption  that  the  PE  chip’s  physical  size  is  limited.  Chapter  6 
presents  I-cache  speedup  measurements  wherein  I-cache  chip  area  is  provided  by  limiting  cache  size, 
reducing  PE  register  count,  reducing  PE  chip  PE  count,  or  reducing  PE  function  unit  complexity. 
The  results  show  that  while  none  of  these  strategies  for  providing  chip  area  for  I-cache  is  universally 
desirable,  it  is  important  to  make  the  I-cache  large  enough  to  contain  entire  loop  bodies. 

Appendices  are  included  that  describe  in  detail  the  basis  computer,  the  assembly  language  used 
to  describe  the  computations  for  the  sample  problems,  an  example  of  developing  a  sample  problem’s 
solution  and  then  adding  I-cache  to  the  computation,  the  set  of  sample  programs,  and  the  complete 
set  of  measured  I-cache  speedups. 

The  dissertation  shows  that  for  diverse  problems,  even  simple  I-cache  variants  }rield  significant 
throughput-to-area  ratio  increases  over  generic  SIMD  computers.  'The  factors  by  which  I-cache 
increases  throughput-to-area  ratio  depend  on  the  properties  of  programs,  of  VLSI  implementation 
technique,  of  PE  architecture,  and  of  the  architecture  of  the  replicated  chip  containing  the  PEs. 
Significant  speedups  are  measured  for  each  of  a  diverse  collection  of  problems,  under  reasonable 
assiunptions  regarding  the  electrical  characteristics  of  the  computer.  That  significant  speedups 
arise  in  each  case  shows  that  even  some  problems  whose  computations  are  ordinarily  assumed  to  be 
communication-bound  benefit  from  I-cache. 

The  simulation  results  show  that  I-cache  makes  good  use  of  chip  area  in  the  PE  chip.  'Therefore, 
scalable  data-parallel  appUcations  for  which  it  is  paramoxmt  to  make  good  use  of  chip  area  demand 
I-cache.  The  observed  I-cache  speedups,  along  with  the  modest  estimated  chip-area  cost  of  I-cache, 
suggest  that  an  I-cached  SIMD  computer  would  exhibit  the  highest  throughput  of  any  programmable 
multiprocessor  for  scalable  data-parallel  problems. 
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Chapter  2 

The  SIMD  Instruction  Cache  Idea 


In  SIMD  computers,  Tna’giniiiTn  PE  instruction  execution  rate  is  higher  than  global  instruction  broad¬ 
cast  rate.  The  ratio  between  these  rates  is  denoted  pb-  alternatives  available  for  achieving  max- 

imum  throughput  when  pb  >  1  include  SIMD  instniction  cache,  or  I-cache.  I-cache  in  this  context 
is  new.  Its  function,  and  therefore  its  design,  are  subtly  different  from  those  of  the  more  familiar 
caches  that  are  used  in  microprocessors. 

By  presenting  some  of  the  main  hardware  and  control  choices  for  I-cache,  this  chapter  sketches 
the  I-cache  design  space.  This  chapter  concludes  with  an  example  of  how  I-cache  is  used  and  how  it 
impacts  the  throughput  of  a  SIMD  computation. 


2.1  Instruction  Delivery  in  SIMD  Computers 

The  PE  of  a  SIMD  computer  contains  only  the  minimum  circuitry  needed  to  operate  its  data  storage 
and  function  units  under  control  of  the  single  instruction  stream  which  it  shares  with  the  other  PEs. 
Conceptually,  a  SIMD  PE  may  be  thought  of  as  a  MIMD  PE  whose  program-control  components  that 
store,  fetch,  and  decode  instructions  have  been  removed. 

On  one  hand,  the  SIMD  PE’s  simplicity  means  that  it  occupies  TniniTnum  chip  area.  Minimum 
chip  area  per  PE  allows  the  greatest  number  of  PEs  to  be  packed  into  a  given  total  chip  area.  The 
densest  packing  of  PEs  should  translate  into  maximxim  performance  per  hardware  cost  for  scalable 
data-paraUel  problems. 

On  the  other  hand,  the  SIMD  PE’s  simplicity  introduces  a  new  limitation  in  the  rate  at  which 
instructions  are  executed.  The  simplified  PE  does  not  perform  its  own  program  control.  Instead, 
the  system  controller  provides  a  new  machine  code  instruction  on  each  cycle  of  the  system  clock. 
The  limitation  arises  when  the  rate  at  which  instructions  are  provided  fiom  afar  is  lower  than  the 
maximum  rate  of  PE  calculation. 

The  SIMD  PE’s  simplicity  allows  many  PEs  to  be  packaged  within  a  PE  chip  of  modest  size  using 
current  VLSI  implementation  technique.  The  system  controller  in  a  SIMD  computer  performs  the 
program-control  function  once  for  all  PEs,  providing  a  fi:esh  instruction  to  the  PEs  on  each  cycle  of 
the  system  dock.  This  arrangement  is  sketched  in  Figure  2.1. 

PE  calculation  takes  place  entirely  within  the  confines  of  the  PE  chip.  By  contrast,  the  electrical 
pathway  canying  instructions  fixjm  the  system  controller  must  cross  between  chips  at  least  once 
before  it  arrives  at  any  PE. 

The  set  of  wires  through  which  instructions  are  delivered  to  the  PE  chips  fix>m  the  system 
controller  is  the  global  instruction  broadcast  network.  Let  R  denote  the  highest  rate  at  which  the 
PEs  within  the  PE  chip  can  execute  instructions.  Let  B  denote  the  rate  at  which  instructions  are 
delivered  to  the  PE  chips  through  the  global  instruction  broadcast  network.  Then  ph-§  is  the  ratio 
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Figure  2.1:  The  system  controller  provides  a  system  dock  and  broadcasts  a  new  instruction  on  each 
clock  cyde. 


of  these  two  rates.  That  is,  pb  denotes  the  factor  by  which  the  highest  rate  of  PE  operation  exceeds 
the  rate  of  global  instruction  broadcast. 

In  existing  SIMD  computers,  fib’ll  because  R^B.  This  parity  between  R  and  B  arises  in  existing 
SIMD  computers  not  because  global  instruction  broadcast  networks  are  carefuUy  engineered  (so  that 
B  is  high)  but  rather  because  PEs  are  artificially  slow  (making  low).  As  a  typical  example,  consider 
CM*2,  a  well-known  SIMD  computer  whose  PE  contains  a  bit-serial  function  unit  [22].  The  system 
clock  rate  in  CM-2  is  about  8MHz.  For  comparison,  chips  of  complexify  greater  than  that  of  the 
CM-2  PE  have  been  fabricated  using  similar  VLSI  process  technology  that  run  at  dock  rates  as  high 
as  lOOMHz  [86].  This  disparity  is  a  factor  of  12.  Other  SIMD  computers  whose  PE  chips  operate 
surprisingly  slowly  indude  VASTOR  (2MH2)  [87],  CLIP7A  (5MHz)  [32],  AIS-5000  (7MHz)  [68],  AAP2 
(lOMHz)  [53],  CAAPP  (lOMHz)  [73],  MP-1  (14MHz)  [62],  and  BUtzen  (20MHz)  [36]. 

One  ostensible  reason  for  making  R  artifidally  low  might  be  to  reduce  the  cost  of  a  PE  chip.  For 
example,  it  was  possible  to  use  relatively  inexpensive  external  memory  chips  with  the  slow  PE  chips 
in  CM-2  at  no  performance  penalty.  Unfortunately,  an  economic  rationale  for  making  R  artificially 
low  is  inconsistent  with  the  design  objectives  of  SIMD  computer  architecture.  Making  best  use  of 
chip  area  is  paramoimt  for  a  relatively  small  subset  of  all  computational  problems,  so  PE  chips  for 
SIMD  computers  are  produced  in  relatively  small  quantities.  Economies  of  scale  make  SIMD  PE 
chips  relatively  costly  per  part.  Given  that  the  monetary  cost  of  a  PE  chip  is  high,  saving  cost  by 
under-designing  the  PE  chip  or  by  using  inexpensive  external  memories  is  not  a  good  reason  for  R  to 
be  as  low  as  B.  If  monetary  cost  is  the  prevalent  concern,  then  SIMD  computer  architecture  is  not 
likely  to  be  an  attractive  choice.  However,  where  maximum  throughput  is  desired  for  the  available 
total  chip  axea,  or  where  miniTnnTn  total  chip  area  is  desired  in  achieving  a  given  throughput  target, 
SIMD  computer  architecture  is  a  compelling  alternative,  so  long  as  R  is  not  unnecessarily  low. 

Making  the  PE  chip  as  fast  as  possible,  such  that  R  is  maximiun  for  the  PE  architectxire  and  for  the 
VLSI  implementation  technique,  presents  design  challenges  for  system  integration.  While  pb  need 
not  be  as  large  as  the  factors  around  10  suggested  by  the  operation  rates  and  VLSI  implementation 
techniques  of  existing  SIMD  computers,  it  may  be  the  case  that  even  using  high-speed  interconnect 
techniques  to  globally  broadcast  instructions,  pb  is  yet  greater  than  1,  perhaps  as  high  as  2  or  3.  Fast 
instruction  broadcast  requires  careful  engineering  of  the  global  instruction  broadcast  network,  and  it 
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reduces  the  flexibility  of  the  system  with  respect  to  scaling  and  geometric  re-arrangement  of  the  PEs. 
Therefore,  higher  values  of  /7b  are  associated  with  SIMD  computers  designed  for  scalability  at  lower 
system  re-integration  costs.  A  further  potential  drawback  of  fast  broadcast  of  instructions  is  that  it 
demands  the  system  controller,  which  performs  a  potentially  complicated  control  function,  to  operate 
at  high  speed.  Finally,  the  pin-limited  nature  of  PE  chips  makes  time-sharing  of  instruction  receiver 
pins  attractive,  if  time-sharing  can  take  place  without  compromising  throughput.  Compromise  in 
respect  of  any  of  these  factors  leads  to  large  values  of  p^,. 

2.2  Overcoming  Slow  Instruction  Delivery 

If  the  PEs  are  to  perform  calculations  at  the  highest  possible  rates,  it  may  be  possible  to  bring  B 
into  parity  with  R  through  careful  electrical  engineering  of  the  global  instruction  broadcast  network. 
However,  if  it  is  impractical  to  do  this,  then  pb  is  larger  than  1.  This  possibility  leads  the  computer 
designer  to  wonder:  What  are  the  architectural  alternatives  to  overcoming  the  instruction  execution 
rate  limitation  that  arises  when  is  significantly  greater  than  1?  Is  it  possible  to  take  advantage  of 
a  value  of  R  that  is  greater  than  B,  or  must  R  be  made  artificially  low? 

One  option  is  to  make  the  PE  chips  microprogrammed.  With  microprogramming,  globally  broad¬ 
cast  instructions  encode  sequences  of  single-cycle  PE  operations.  A  microcontroller  inside  the  PE 
chip  sequences  each  globally  broadcast  instruction’s  microprogram  for  the  PEs  within  the  chip. 

The  global  broadcast  of  microcoded  instructions  has  been  applied  in  limited  ways  in  existing  SIMD 
computers.  For  example,  multiply  and  divide  operations  in  SLAP  are  controlled  using  a  pair  of  glob¬ 
ally  broadcast  instructions,  one  to  initiate  and  one  to  terminate  an  arithmetic  sequence  [27](p.373). 
The  PE  instructions  carrying  out  the  arithmetic  sequence  are  provided  firom  within  the  PE  chip. 
This  design  choice  was  made  for  SLAP  to  fi:ee  up  the  PE  chip’s  instruction  pins  so  that  inter-chip 
communication  operations  could  be  controlled  concurrently  with  multiplication  or  division.  The  PE 
chip  in  SLAP  is  clocked  at  the  rate  of  global  instruction  broadcast. 

CM-2  incorporates  microprogramming  in  a  limited  way.  Driven  by  a  high-level  language  pro¬ 
gram,  the  CM-2  fix)nt  end  generates  complex  instructions  that  are  decomposed  into  single-clock-cycle 
operations  for  the  PEs  by  a  microcoded  sequencer.  This  design  choice  was  made  for  CM-2  to  simplify 
the  computer’s  high-level  programming  interface.  The  CM-2  sequencer  is  not  integrated  in  the  PE 
chip,  so  it  could  not  help  alleviate  the  consequences  of  a  large  value  of 

An  alternative  to  microprogramming  the  PEs  is  SIMD  instruction  cache.  SIMD  instruction  cache, 
or  I-cache,  exploits  temporal  locality  in  the  broadcast  instruction  stream.  The  I-cache  is  a  buffer 
within  the  PE  chip  that  stores  instruction  sequences  that  are  identified  as  being  repeated.  Repetitions 
of  a  stored  sequence  are  subsequently  delivered  to  the  PEs  at  the  highest  rate  of  PE  operation.  Each 
PE  does  not  necessarily  need  its  own  I-cache;  one  I-cache  is  able  to  provide  instructions  to  the 
collection  of  PEs  located  in  a  PE  chip. 

I-cache  and  microcoding  both  involve  beefing  up  the  program  control  provided  within  the  PE  chip. 
Comparing  these  two  alternatives,  it  is  apparent  that  I-cache  suffers  the  relative  disadvantage  that 
instruction  sequences  must  first  be  stored  at  the  slow  broadcast  rate  B  before  they  can  be  retrieved 
at  the  high  chip  rate  R.  An  inherent  advantage  of  I-cache  where  chip  area  is  limited  is  that  only  the 
instruction  sequences  needed  for  a  given  computation  are  stored  in  an  I-cache.  By  contrast,  a  set  of 
microprograms  committed  to  ROM  in  the  PE  chip  may  be  large  and,  at  best,  difficult  to  modify. 


2.3  A  New  Use  of  the  Term  “Instruction  Cache” 

A  SIMD  instruction  cache  differs  in  a  number  of  important  ways  firom  the  direct-mapped  instruction 
caches  tsq/ically  used  in  microprocessors.  This  section  highlights  the  main  differences. 


12 


CHAPTER  2.  THE  SIMD  INSTRUCTION  CACHE  IDEA 


The  principal  difference  between  conventional  instruction  cache  and  SIMD  I-cache  is  that  SIMD 
I-cache  requires  explicit  control.  The  presence  of  ordinary  instruction  cache  is  not  apparent  in 
programs  and  involves  no  change  to  the  instruction  set  architecture  of  the  computer.  A  SIMD  I-cache, 
by  contrast,  is  explicitly  managed  in  programs  through  the  use  of  a  small  number  of  cache-control 
instructions  added  to  the  set  of  globally  broadcastable  instructions.  A  SIMD  I-cache  op>erates  imder 
the  programmer's  control  in  a  pre-determined  manner,  whereas  an  ordinary  instruction  cache  exploits 
temporal  locality  in  an  instruction  stream  opportunistically  without  the  programmer's  intervention. 

Instructions  are  read  fimm  an  ordinary  cache  on  an  individual  basis  to  satisfy  cache  hits,  and 
instructions  are  stored  in  an  ordinary  cache  as  fixed-size  blocks  (or  lines)  to  exploit  expected  spatial 
locality  in  an  instruction  stream.  A  SIMD  I-cache  is  not  read  for  individual  instructions,  nor  are 
instructions  written  into  it  in  fixed-size  lines.  Rather,  a  SIMD  I-cache  stores  instruction  sequences 
of  varying  lengths.  The  length  of  a  SIMD  I-cache  block  matches  the  length  of  the  corresponding 
repeated  sequence  of  instructions.  For  example,  the  body  of  an  inner  loop  is  managed  as  a  single 
block  in  a  SIMD  I-cache. 

SIMD  I-cache  is  managed  differently  fi*om  ordinary  instruction  cache.  Whereas  with  an  ordinary 
cache  comes  hardware  to  determine  dynamically  whether  each  successive  instruction  reference  may 
be  satisfied  firom  cache,  the  SIMD  I-caches  discussed  herein  use  no  such  hardware.  Rather,  these 
SIMD  I-caches  £ire  managed  statically,  by  the  programmer,  perhaps  in  conjimction  with  a  compiler. 
This  characteristic  is  not  inherent,  but  it  is  appropriate  for  demonstrating  the  concept  of  SIMD 
I-cache  with  programs  whose  fiow-graph  structures  are  easily  statically  analyzed.  From  the  point 
of  view  of  how  the  cache  memory  is  managed,  static  management  of  SIMD  I-cache  resembles  the 
well-known  compile-time  problem  of  register  allocation. 

A  microprocessor  contains  a  program-control  component  that,  among  other  things,  generates 
instruction  memory  references.  Those  instruction  memory  references  are  sped  up  through  the  use 
of  ordinary  instruction  cache.  By  contrast,  a  SIMD  PE  chip  does  not  generate  instruction  memory 
references,  because  it  instead  passively  receives  a  sequence  of  globally  broadcast  instructions.  There¬ 
fore,  instructions  are  placed  explicitly  in  a  SIMD  I-cache,  imder  the  direction  of  globally  broadcast 
instructions.  Subsequently,  the  PE  chip’s  local  controller  is  directed  explicitly  to  retrieve  a  stored 
sequence  of  instructions  firom  I-cache  for  execution  by  the  PEs. 

The  difference  between  SIMD  I-cache  and  ordinary  instruction  cache  can  be  summed  up  by 
observing  that  while  some  accesses  to  an  ordineuy  instruction  cache  miss,  all  accesses  to  a  SIMD 
I-cache  hit.  If  a  needed  instruction  is  not  present  in  I-cache,  then  it  is  globally  broadcast  as  it  would 
have  been  in  a  SIMD  computer  without  I-cache.  I-cache  does  not  change  the  lock-step  natxire  of 
instruction  execution  among  the  PEs  of  a  SIMD  computer.  All  PEs  still  receive  a  common  sequence 
of  instructions,  but  a  single  I-cache  in  the  PE  chip  makes  it  possible  to  provide  repeat  instructions 
at  the  highest  rate  to  the  collection  of  PEs  within  the  chip. 

It  is  now  evident  that  there  are  many  differences  between  a  SIMD  I-cache  and  the  architectural 
component  ordinarily  connoted  by  the  term  “cache”.  Despite  these  differences,  the  term  is  used  here 
to  refer  to  PE  chip  instruction  buffers  because  in  its  most  general  sense,  “cache”  means  a  relatively 
small,  fast  repository  used  to  speed  up  computation.  The  reader  need  only  bear  in  mind  that  instead 
of  circumventing  redxmdant  slow  accesses  to  some  larger  repository  of  instructions,  SIMD  I-cache  is 
used  to  eliminate  redundant  instruction  broadcasts  through  a  slow  network. 


2.4  I-Cache  Design  Parameters 

Just  what  does  a  SIMD  instruction  cache  look  like?  This  section  outlines  the  I-cache  design  space  by 
identif3dng  four  major  physical  design  choices. 


2.4.  LCACHE  DESIGN  PARAMETERS 
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1.  Number  of  cache  blocks:  An  I-cache  may  contain  a  single  block  at  a  time,  or  it  may  contain 
multiple  blocks. 

A  single-block  I-cache  is  simplest  with  respect  to  delimiting  blocks  in  cache  memory  and  also 
with  respect  to  managing  the  available  memory,  because  there  is  a  unique  starting  address  for 
any  cache  block.  For  a  multi-block  I-cache,  the  globally  broadcast  instruction  activating  a  cache 
block  must  specify  the  starting  address  of  the  block  in  cache. 

Containing  more  than  one  block  at  a  time  is  advantageous  for  computations  wherein  a  number 
of  repeated  instruction  sequences  alternate  over  the  course  of  the  program’s  execution.  With 
a  single-block  I-cache,  the  alternating  sequences  must  be  re-stored  in  I-cache  prior  to  each 
use,  because  the  alternating  sequences  displace  one  another.  Such  re-storing  of  instruction 
sequences  in  cache  is  a  form  of  thrashing  that  reduces  I-cache  effectiveness. 

2.  Number  of  iterations  per  cache  block  activation:  An  I-cache  controller  may  sequence  single 
passes  through  a  cache  block,  or  it  may  be  capable  of  sequencing  miiltiple  passes  in  response  to 
a  global  broadcast  instruction  activating  the  cache  block. 

A  single-pass  I-cache  is  simplest  with  respect  to  managing  iterations.  A  multi-pass  I-cache 
variant  requires  an  equivalent  of  an  iteration  coxmter.  A  multi-pass  I-cache  variant  might,  for 
example,  be  one  in  which  loop-control  instructions  may  be  placed  in  cache. 

A  multi-pass  I-cache  is  advantageous  for  instruction  sequences  whose  duration  is  not  an  integer 
multiple  of  pb-  After  each  pass  through  such  a  sequence,  a  single-pass  I-cache  variant  awaits 
the  next  globally  broadcast  instruction  to  arrive.  A  multi-pass  I-cache  variant,  by  contrast, 
does  not  wait  after  completing  one  iteration;  rather,  it  begins  sequencing  the  next  iteration 
immediately. 

3.  Number  of  cache  memory  ports;  Cache  memory  may  have  a  single  port,  or  it  may  have  more 
ports. 

Single-port  memories  are  electrically  simpler  to  design  than  multi-port  memories,  and  their 
cells  lay  out  more  compactly  than  do  multi-port  memory  cells. 

Using  a  two-port  cache  memory  in  conjunction  with  a  multi-block  I-cache  variant,  it  is  possible 
to  pre-store  one  cache  block  while  another  cache  block  is  active.  Using  a  second  cache  memory 
port  in  this  way  is  a  form  of  prefetching  that  minimizes  the  time  spent  idly  by  the  PEs  waiting 
for  the  next  block  to  be  placed  in  cache. 

4.  Nesting  of  cache  blocfes;  A  multi-block  I-cache  variant  may  or  may  not  allow  cache  blocks  to 
activate  one  another. 

An  I-cache  without  nesting  is  simpler,  because  no  cache  block  execution  stack  is  needed. 

Allowing  cache  block  activation  instructions  to  be  placed  in  cache  memory  means  that  arbitrarily 
complex  loop  structures  may  be  cached  entirely.  Once  an  entire  program  is  stored  in  cache,  the 
PE  chips  need  not  wait  for  globally  broadcast  instructions  over  the  course  of  the  computation. 

A  PE  chip  containing  an  I-cache  variant  of  this  complexity  may  be  viewed  as  a  SIMD  computer  in 
its  own  right,  because  the  program  stored  in  cache  which  controls  the  collection  of  PEs  in  the  PE 
chip  may  execute  independently  of  the  other  PE  chips  in  the  computer.  SIMD  computers  with 
multiple  program-control  units  are  called  Multi-SIMD  (or  MSIMD)  computers,  as  exemplified 
in  GPA  [10]. 

This  dissertation  evaluates  in  detail  two  of  the  simplest  I-cache  variants,  Fq  and  F2.  Fq  is  the 
simplest  I-cache  variant,  containing  one  block  at  a  time  that  is  executed  in  single  passes.  Fq  is  a 
“one-block,  one-shot”  I-cache  variant.  Of  course,  Fq  has  only  one  port  and  the  question  of  cache  block 
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nesting  is  moot.  F2  is  almost  identical  to  Fo,  with  the  addition  of  the  ability  to  execute  multiple 
iterations  of  a  cache  block  from  a  single  globally  broadcast  activation  instruction.  F2  is  a  "one-blo^, 
multi-shot”  I-cache  variant. 


2.5  Management  of  I-Cache 

All  components  of  a  SIMD  computer  are  centrally  controlled  (by  the  system  controller),  and  I-cache 
is  no  different.  Although  each  PE  chip  has  its  own  I-cache,  the  states  of  aU  of  them  are  identical. 
The  I-cache  replicated  in  each  PE  chip  is  logically  redundant,  but  it  serves  to  remove  redxmdantly 
repeated  instructions  from  the  global  broadcast  instruction  stream.  With  I-cache,  those  repeated 
instructions  may  instead  be  delivered  to  the  PEs  within  the  PE  chip  from  the  fast  on-chip  repository 
also  within  the  PE  chip. 

I-cache  is  managed  by  the  system  controller  iising  globally  broadcast  cache-control  instructions. 
The  local  controller  within  the  PE  chip  is  directed  when  to  begin  storing  instructions  in  cache,  when 
to  stop  storing  instructions  there.  Subsequently,  the  system  controller  instructs  the  local  controller 
to  begin  executing  a  cache  block  by  providing  the  parameters  needed  to  activate  the  cache  block. 

Management  of  SIMD  instruction  cache  is  performed  either  statically  by  a  programmer  or  com¬ 
piler,  or  dynamically  by  a  cache  management  algorithm  running  on  the  system  controller.  Static 
cache  management  occurs  prior  to  execution  of  a  program.  Static  management  is  accomplished  by 
modifying  the  program  controlling  the  computation  that  is  loaded  into  the  system  controller  before 
the  outset  of  the  computation.  Static  management  is  sometimes  useful  also  for  ordinary  instruction 
caches;  a  compiler  that  statically  re-arranges  instructions  in  memory  to  increase  the  efficiency  of 
uniprocessor  caches  is  reported  in  [57].  The  static  management  of  I-cache  is  reminiscent  of  the  over¬ 
laying  used  in  programs  to  manage  the  small  main  memories  of  early  computer  systems  including 
the  DEC  PDP-11  and  the  IBM  360-370  [llKp.l5).  Dynamic  I-cache  management  amounts  to  a  case 
of  the  well-known  memory  management  problem  that  arises,  for  example,  in  implementing  virtual 
memory  [72](Sec.5.2). 

Whether  I-cache  is  managed  statically  or  dynamically,  and  however  complex  the  I-cache  design 
variant,  there  is  a  single  set  of  cache-management  sub-problems  that  are  solved  in  all  cases.  These 
sub-problems  are: 

1 .  identifying  the  cachable  instruction  sequences, 

2.  determining  which  sequences  are  stored  in  cache, 

3.  determining  where  in  cache  to  store  cache  blocks, 

4.  scheduling  cache  blocks  appropriately, 

5.  directing  the  storing  of  the  scheduled  cache  blocks  in  cache  prior  to  their  use, 

6.  and  activating  stored  cache  blocks  at  the  appropriate  points  in  the  computation. 

These  sub-problems  are  considered  in  more  detail  later  (in  Section  4.5).  Although  the  I-caches 
evaluated  herein  are  managed  statically,  this  has  been  done  for  simplicity  in  the  system  controller. 
The  evaluations  are  performed  for  sample  problems  for  which  the  best  use  of  I-cache  is  straight¬ 
forward.  For  programs  whose  flow  graphs  are  difficult  to  analyze  well  statically,  the  system  controller 
should  contain  a  imit  that  performs  dynamic  I-cache  management.  Such  an  I-cache  management 
unit  would  maintain  a  model  of  which  instructions  are  stored  where  in  I-cache.  Before  globally 
broadcasting  a  cachable  instruction  sequence,  the  I-cache  management  unit  would  be  consulted  to 
determine  if  the  sequence  were  already  in  cache.  If  so,  the  sequence  woffid  be  executed  from  cache. 


2.6.  AN  EXAMPLE  OF  LCACHE  USE 
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If  not,  the  decision  would  be  made  at  that  point  whether  to  place  the  sequence  in  cache,  and  where. 
The  examples  of  static  I-cache  management  given  in  this  dissertation  are  indicative  of  how  those 
decisions  are  made. 


2.6  An  Example  of  I-Cache  Use 

This  section  presents  a  simple  example  of  how  I-cache  is  used.  For  an  industrial-strength  example 
for  an  assembly  language,  the  reader  should  consult  Appendix  C. 

Consider  the  skeleton  for  program  siJBplest,  consisting  only  of  a  simple  loop: 

program  slsplaat; 
for  j  »  1  to  J  do 
A 

and; 

aed  alaplast; 

The  symbol  A  in  program  siaplest  denotes  the  sequence  of  instructions  that  is  the  loop  body. 
Assuming  that  A  is  a  cachable  instruction  sequence,  the  problems  of  determining  which  sequence  to 
store  in  cache,  when  to  store  it,  and  when  to  activate  it  have  obvious  solutions. 

Recall  that  an  Fq  I-cache  is  the  simplest  1-cache  variant,  capable  of  storing  only  one  cache  block  at 
a  time  and  executing  only  single  passes  through  the  stored  block.  The  following  skeleton  for  program 
siaplest.cache  illustrates  how  program  sisples^  is  modified  to  use  an  an  Fq  I-cache: 

program  alriplaat.eaeha; 

mtorm  aaqoaneaA  In  each* 
for  j  B  1  to  J  do 

aetivata  cac^Md  aaguanoa  A 
and; 

and  aiqplaat.oacba; 

Ideally,  each  pass  through  cached  sequence  A  in  siioplest-cachm  is  p\,  times  faster  than  each 
corresponding  iteration  of  the  inner  loop  in  stnplest.  However,  storing  A  in  cache  is  an  extra  pass 
through  that  sequence  of  instructions  in  sioplest-cache  that  has  no  counterpart  in  sinplest. 


2.7  I-Cache  Speedup 

In  general,  the  benefit  of  using  I-cache  for  a  program  can  be  measured  directly  by  comparing  the 
execution  time  of  the  program  on  a  baseline  generic  SIMD  computer  against  that  of  a  version  modified 
to  run  on  a  SIMD  computer  with  I-cache.  The  ratio  of  these  two  times  gives  the  speedup  due  to  I-cache 
for  the  subject  program. 

A  convenient  unit  to  measure  the  time  taken  to  run  a  SIMD  computation  is  numbers  of  cycles  of 
the  system  clock.  For  example,  the  time  to  run  program  sixqplest  above  is  given  as 


time  for  siaplest  =  ^  |A|  cycles 

where  |  A|  denotes  the  number  of  system  clock  cycles  taken  for  instruction  sequence  A 
Assuming  that  instruction  sequence  A  is  cachable  and  that  it  runs  faster  firom  cache  by  the  factor 
Pb,  then  the  time  for  the  modified  program  to  nm  with  I-cache  is  given  as 

=  |A|  +  J  ♦  —  cycles 
Ph 


time  for  siaplest-cache 
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The  I-cache  speedup  is  given  as  the  ratio  of  these  two  execution  times: 


speedup  for  sisples^ 


^HA| 

J 


1 

y 

j 


♦  pb 

♦  Pb 


(2.1) 

(2.2) 


Equation  2.2  suggests  that  the  speedup  for  program  siaplast  approaches  ^  as  J  approaches  oo. 
At  the  other  extreme,  if  sequence  A  is  executed  just  once  (so  that  J^l ),  then  the  speedup  is  less  than 
1 .  While  I-cache  makes  it  possible  to  speed  up  a  computation  when  /»b  is  large,  if  used  inappropriately 
then  I-cache  can  actually  slow  down  a  computation.  This  possibility  arises  because  the  time  needed 
to  store  an  instruction  sequence  in  a  simple  I-cache  is  not  negligible. 

Equation  2.1  points  up  an  analogy  between  I-cache  speedup  and  speedup  due  to  parallelism. 
Recall  that  a  P-PE  multiprocessor  achieves  up  to  P  times  the  throughput  of  its  uniprocessor  counter^ 
part,  although  in  practice  mxiltiprooessor  speedup  is  usually  consider£d)ly  less  than  that  limit  [78].  In 
general,  C  of  the  N  instructions  executed  in  the  iiniprocessor  computation  are  inherently  sequential, 
such  that  they  cannot  be  run  faster  through  parallel  execution.  The  multiprocessor  speedup  limit  is 
given  in  Equation  2.3: 


speedup  due  to  parallelism  < 


(2.3) 


Amdahl’s  law  is  the  observation  that  multiprocessor  speedup  cannot  exceed  irrespective  of 
the  number  of  PEs  in  the  mxiltiprocessor.  For  example,  if  90%  of  the  instructions  executed  on  a 
uniprocessor  execute  P  times  faster  on  a  P-PE  multiprocessor  while  the  rest  of  the  instructions 
execute  no  faster  on  the  mxiltiprocessor  (such  that  ^’’S),  then  the  speedup  firom  parallelism  cannot 
exceed  10.  In  this  example,  for  P^IOOO,  the  actual  speedup  is  just  over  9.9. 

The  analogy  with  Amdahl’s  law  for  I-cache  speedup  is  that  speedup  cannot  exceed  J,  irrespective 
of  the  value  of  p^.  In  an  I-cached  SIMD  computation,  some  instructions  must  be  globally  broadcast, 
for  example  to  store  them  m  I-cache  as  in  program  siaplest.cache  above.  Just  as  the  actual 
speedup  from  multiprocessing  is  usually  less  than  the  number  of  PEs  P,  the  actual  I-cache  speedup 
is  usually  less  than  the  limit  of  p^,  even  for  ideally  suitable  problems.  This  limiting  effect  of  7  is 
evident  in  the  small  speedups  for  siioplest  for  low  values  of  7  in  Figure  2.2. 

This  analogy  with  Amdahl’s  law  shows  the  dangers  in  assiiming  naively  that  the  benefit  fix>m 
I-cache  is  proportional  to  pb*  (act,  the  interactions  among  I-cache  design,  system  design,  and 
program  properties  are  complex.  The  difficulty  in  ascertaining  a  priori  the  impact  of  I-cache  on 
the  throughput  and  chip  area  of  a  given  SIMD  computation  necessitates  the  empirical  evaluation  of 
I-cache  variants  that  is  the  focus  of  this  dissertation. 


CHAPTER  2. 


INSTRUCTION  CACHE  WEA 


Chapter  3 

SIMD  Computer  Implementation 


I-cache  is  an  architectural  element  added  to  the  PE  chips  of  a  SIMD  computer  to  increase  the  rate 
at  which  repeated  instructions  are  provided  to  the  PEs.  So  I-cache  changes  the  physical  structure 
of  the  computer  and  the  time  reqtdred  to  perform  a  computation.  Because  it  is  explicitly  managed 
by  broadcast  instructions,  I-cache  also  changes  the  logical  structure  of  a  SIMD  computation.  To  be 
able  to  assess  the  impact  of  these  many  changes,  analysis  of  I-cache  req\ures  a  model  of  the  SIMD 
computer  that  encompasses  both  the  physical  structure  of  the  machine  and  the  logical  structure  of 
the  computations  it  performs. 

The  SIMD  computers  produced  over  the  years  have  been  designed  vmder  continually  rhangii^g 
technological  constraints  with  a  diversity  of  specific  application  targets.  The  great  variety  in  PE 
architectures,  sizes,  interconnection  topologies,  and  engineering  cost  relationships  frustrates  the 
goal  of  describing  all  SIMD  computers  with  a  single,  universal  model. 

Nonetheless,  it  is  possible  to  identify  the  salient  elements  that  differentiate  a  SIMD  computer 
firom  any  other  kind  of  multiprocessor  Furthermore,  it  is  possible  to  combine  those  elements  in  a 
model  that  is  parameterized  to  captiue  a  broad  range  of  implementation  alternatives. 

This  chapter  describes  the  model  of  generic  SIMD  computation  underlying  the  analysis  of  I-cache. 
The  model  highlights  the  PE,  whose  replication  in  large  nximbers  at  least  cost  is  the  main  objective 
of  SIMD  computer  architecture.  An  example  illustrates  how  the  SIMD  computer's  components  carry 
out  computation. 

The  SIMD  computer's  subsystems  that  move  information  between  chips  (multi-chip  subsystems, 
or  MCSs)  are  represented  in  the  model  in  the  same  way  as  PE  function  units  (or  FUs).  This  \iniformity 
of  representation  facilitates  writing  assembly  language  programs  that  describe  SIMD  computations 
as  sequences  of  FU  and  MCS  operations  without  regard  to  the  detailed  implementation  of  these 
sometimes  complicated  components. 

The  SIMD  computer  executes  machine  code  programs  which  specify  the  clock  cycle  by  clock  cycle 
activity.  Assembly  language  programs  are  drawn  on  a  sequential  model  wherein  each  instruction’s 
execution  completes  before  the  next  instruction’s  execution  begins.  Machine  code  programs  reflect 
the  physical  characteristics  of  the  computer,  wherein  the  length  of  time  required  to  complete  an 
instruction  depends  on  the  operation  it  specifies,  and  wherein  mutually  independent  operations  are 
performed  conciurently,  resoiurces  permitting. 

The  most  important  aspects  of  this  model  of  SIMD  computation  reflect  the  characteristics  of 
VLSI-based  implementation.  The  PE  chips  contain  many  PEs  whose  FUs  can  operate  at  the  high 
rates  attainable  within  chips.  The  MCS,  for  example  that  through  which  the  PEs  inter-communicate 
and  that  access  memory  external  to  the  PE  chip,  contain  wires  that  run  between  chips.  Inter-chip 
wires  typically  present  relatively  large  capacitances  and  long  distances,  and  circuits  containing  them 
are  typically  slower  than  intra-chip  circuits.  Therefore,  MCSs  typically  operate  slower  than  PE  FUs. 
The  global  instruction  broadcast  network  is  the  MCS  whose  operation  rate  relative  to  that  of  the  PE 
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is  crucial  to  the  value  of  I-cache.  Also  important  are  the  operation  rates  of  the  other  MCSs;  in  the 
model,  these  rates  are  parametric  relative  to  the  PE  clock  rate. 

The  model  allows  I-cache  to  be  described  as  a  detailed  change  to  the  structure  of  a  SIMD  compu¬ 
tation.  The  model  also  provides  a  basis  for  evaluating  the  throughput  impact  of  1-cache. 


3.1  Greneric  SIMD  Computer 

Figure  3.1  depicts  a  SIMD  computer.  The  system  controller  generates  a  system  dock  that  regulates 
all  elements.  The  system  controller  also  sequences  instructions  and  broadcasts  them  via  a  global 
instruction  broadcast  network  to  an  array  of  PE  building  blocks.  Other  elements  of  the  computer, 
including  an  inter-PE  communication  subsystem,  a  data  HO  subsystem,  and  a  response  subsystem, 
each  comprises  one  or  more  chips  connecting  through  inter-chip  wires  to  at  least  one  PE  building 
block,  and  is  each  therefore  a  multi-chip  subsystem,  or  MCS.  The  topology  of  the  inter-PE  commu¬ 
nication  network  is  a  prindpal  discriminator  among  SIMD  computers;  existing  SIMD  computers 
contain  inter-PE  communication  network  topologies  ranging  firom  linear  [27]  to  grid  [4,  30,  36]  to 
multi-stage  permutation  [6, 8]  to  hypercube  [42]. 

The  system  controller  consists  of  a  sequencer  and  a  mechanism  for  evaluating  loop-index-depen¬ 
dent  expressions  and  providing  the  resulting  literal  values  to  the  PEs.  The  system  controller  provides 
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From  Global  Instruction  Broadcast  Natwork 


To  Inter-PE  To  Data  I/O  To 

Communication  Network  Response 

Network  Network 


Figure  3.2:  SIMD  Computer  Building  Block 


the  program-control  functions  required  in  a  SIMD  computer.  The  system  controller  is  here  assumed 
to  use  a  small  set  of  single-cycle  operations.  An  actual  system  controller  may  be  optimized  for  specific 
computations  being  performed  on  a  given  computer,  as  for  example  is  the  case  in  SLAP  [28].  The 
program-control  functions  performed  by  the  system  controller  may  be  complicated.  The  potentially 
crucial  topic  of  SIMD  system  controller  design  is  beyond  the  scope  of  this  work. 

Figure  3.2  depicts  a  building  block  for  a  generic  SIMD  computer.  The  b\ulding  block  contains 
a  PE  chip  connected  via  inter-chip  wires  to  memory  chips.  The  memory  chips  are  included  in  the 
local  external  memory  subsystem,  an  MCS  realized  within  the  building  block  to  accommodate  PE 
data  requirements  exceeding  the  memoiy  capacity  within  the  PE  chip.  The  PE  chip  contains  a  local 
controller,  a  number  of  identical  PEs,  control  and  pin  access  for  the  MCSs. 

Within  the  PE  chip,  the  local  controller  provides  cycle-by-cyde  instructions  to  the  PEs  via  a  local 
instruction  broadcast  network.  The  local  controller  also  generates  the  clocks  regulating  the  PE  chip’s 
constituents;  in  a  generic  SIMD  computer,  there  is  only  one  such  clock,  standardized  to  the  system 
clock.  Figure  3.3  depicts  a  local  controller  for  a  generic  SIMD  computer. 
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Figure  3.3:  Generic  SIMD  Local  Controller 


3.2  Processing  Element 

The  SIMD  computer's  PEs  carry  the  brunt  of  the  computational  load.  The  PE  is  specialized  for 
performing  calculations,  with  the  program-control  functions  being  relegated  to  the  system  controller. 


From  Local  Instruction  Broadcast  Network 


To  PE  chip 

control  and  pin  access 
for  multi-chip  subsystems 

Figure  3.4:  SIMD  Processing  Element 

Figure  3.4  contains  a  sketch  of  the  PE.  The  calculation  component  is  essentially  the  same  as  the 
calculation  component  of  a  uniprocessor,  consisting  of  a  function  unit  (or  FU)  and  register  memory. 
Figure  3.4  indicates  that  the  PE  contains  interfaces  to  the  various  MCSs,  so  that  it  may  access 
external  memory,  communicate  with  other  PEs,  obtain  input  data  sets,  provide  output  data  sets,  and 
signal  data-dependent  conditions  to  the  system  controller. 

The  PE  also  contains  a  context  manager,  which  performs  a  limited  program-control  function. 
The  context  manager  allows  a  SIMD  computer  to  execute  data-dependent  programs,  wherein  the 
sequence  of  executed  instructions  depends  on  values  of  intermediate  resvilts.  Such  data  dependence 
arises,  for  example,  firom  conditional  XF-THEN-ELSE  constructs  in  the  program  executed  by  the  PEs. 

The  context  manager  maintains  a  one-bit  control  flag  as  directed  by  context  management  instruc¬ 
tions  contained  in  the  globally  broadcast  instruction  stream.  These  context  management  instructions 
conditionally  set  and  clear  the  value  of  the  control  flag  based  on  intermediate  data  values.  When  the 
control  flag  is  clear,  the  PE  is  said  to  be  "awake”  in  the  current  context,  and  it  executes  instructions  as 
they  are  broadcast.  However,  when  this  control  flag  is  set,  the  PE  is  said  to  be  "asleep”  in  the  current 
context.  When  the  PE  is  asleep,  all  writes  to  PE  state,  including  registers  and  external  memory,  are 
inhibited.  Instructions  broadcast  when  the  PE  is  asleep  do  not  change  the  state  of  the  PE,  and  thus 
have  no  effect. 


3.3.  AN  EXAMPLE  OF  SOLD  COMPUTATION:  TREE-SUMMATION 
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Steps 


Step  2 


Step  1 


Figure  3.5:  An  example  of  tree-svunmation  with  P-8.  The  PEs  communicate  in  a  tree  pattern  to 
calculate  the  sum  Yil~o  ^  •  Time  progresses  upwards  in  the  figure. 

Conceptually,  the  context  manager  maintains  a  stack  of  contexts  within  the  PE,  pushing  the 
value  of  the  control  fiag  whenever  a  new  context  is  determined.  The  context  manager  may  be 
implemented  as  an  up/down  coimter  with  a  small  amount  of  associated  control  logic  [27].  A  context 
manager  occupies  far  less  chip  area  than  do  a  uniprocessor’s  program  storage  and  program  sequencing 
elements. 

3.3  An  Example  of  SB^ID  Computation:  Tree-Summation 

The  following  example  illustrates  what  the  PE  does  over  the  coiirse  of  a  SIMD  computation.  In  the 
computation  described  here,  the  PEs  form  the  sum  of  an  array  of  integers.  This  example  is  typical  of 
PE  activity  in  a  SIMD  computation,  because  applying  an  associative  operator  to  elements  of  an  array 
occurs  often  in  data-parallel  computations.  The  program'control  activity  of  the  system  controller, 
I/O,  and  the  details  of  communication  are  not  emphasized  here,  in  favor  of  focussing  on  the  PE. 

In  this  example,  the  PEs  are  inter-connected  by  a  routed  inter-PE  communication  network,  lb 
perform  communication,  each  PE  specifies  the  index  of  a  target  PE  and  a  value  to  be  sent  to  that 
PE.  The  fiexible  inter-PE  communication  allows  the  summation  to  be  performed  in  log2  P  steps  on  P 
PEs,  where  there  are  P  integers  to  be  added. 

Figure  3.5  illustrates  the  successive  steps  of  the  tree-summation  for  the  case  where  P=8.  Initially, 
the  array  is  stored  one  element  per  PE.  Figure  3.5  shows  that  the  pattern  of  communicating  PEs 
forms  a  binary  tree.  At  each  leaf  of  the  tree  is  a  PE  storing  one  element  of  the  array.  During  each 
step  of  the  computation,  values  move  one  step  closer  to  the  root  of  the  tree.  Each  active  PE  sends  its 
value  to  its  parent,  and  the  parent  adds  together  the  values  it  receives.  The  parent  becomes  a  child 
on  the  next  step.  Figure  3.5  suggests  that  one  of  the  two  children  sending  a  value  to  a  parent  resides 
in  the  same  PE  as  the  parent.  A  PE  becomes  inactive  following  a  step  when  its  parent  in  the  tree 
resides  in  a  different  PE.  After  log2  P  steps,  the  root  of  the  tree  contains  the  sum  of  the  P  elements. 

The  algorithm  sketch  in  Figure  3.6  shows  the  sequence  of  operations  performed  by  the  PE.  The 
computation  consists  of  a  loop  wherein  each  PE  maintains  a  value,  sum.  sum  is  initialized  in  each  PE 
to  contain  its  assigned  array  element.  (Where  the  PE  index  is  denoted  tt,  sum  in  PE  tt  is  initialized 
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SUB  *  A  [it]  ; 

targmask  »  -1; 

for  (i»0  ;  l<log2  P;  i++) 

{ 

targmask  s  targmask  «  1; 
targ  s  r  CC  targmask; 
if  (targ  =  r) 
targ  =  -1; 

rxval  =  route(sum,  targ) ; 

if  (targ  !=  -1) 
sleep ; 

sum  =  rxval  +  sum; 

} 

wakeup  PE; 


/*  initialize  target  mask  to  all  I's  */ 


/*  shift  target  siask  left  one  position  */ 


/*  do  not  send  to  self  */ 

/*  send  sum  to  PE  whose  indez  is  targ,  where 
it  is  stored  in  rxval.  */ 

/*  Once  its  sum  has  been  sent  to  another  PE, 
this  PE  becomes  inactive.  */ 

/*  Active  PE  accumulates  received  value.  */ 


Figure  3.6:  The  operational  structure  of  the  tree-stunmation  loop,  tt  denotes  the  PE  index  (r  e 
{0...P-1}).  AD  is  the  array  whose  elements  are  added  together,  and  A  [r]  resides  in  the  memory 
of  PE  TT.  sum,  targmask,  targ,  and  rxval  are  local  PE  variables.  The  final  sum  ends  up  in  PE  0. 

to  contain  array  element  AM .)  In  the  loop  body,  the  PE  determines  the  index  of  its  current  tree 
parent  and  sends  sum  to  the  parent.  If  the  PE  is  not  itself  a  parent  at  this  step,  then  it  is  asleep 
for  the  remainder  of  the  tree-summation.  Otherwise,  the  PE  remains  awake  and  replaces  sum  with 
the  sum  of  its  childrens’  values.  The  algorithm  sketched  in  Figure  3.6  is  executed  in  lock-step  on  the 
array  of  PEs. 

The  loop  control  variable  t  is  maintained  in  the  system  controller,  not  in  the  PEs,  because  the  sys¬ 
tem  controller  performs  all  of  the  program-control  functions.  The  PE  operations  shown  in  Figure  3.6 
are  carried  out  using  the  components  shown  in  Figure  3.4.  The  local  PE  variables  used  in  the  loop 
body  are  kept  in  register  memory.  The  first  operation  retrieves  A  M  firom  local  external  memory 
and  places  the  value  in  sum.  Local  external  memory  is  an  MCS  shown  in  Figure  3.2.  If  the  variable 
targmask  is  maintained  in  the  PE,  then  the  shift  operation  is  performed  in  the  FU.  However,  the 
value  of  targmask  is  not  data  dependent  and  it  has  a  common  value  in  all  PEs.  Therefore,  targmask 
could  be  maintained  once  for  all  PEs  by  the  system  controller.  In  either  case,  the  loop  body  begins 
with  FU  operations  to  calculate  targ,  the  index  of  the  PE  to  whom  the  current  value  of  sum  will 
be  sent,  sum  is  then  transmitted  through  the  inter-PE  communication  network  (if  targ  The 
inter-PE  commiinication  network  is  an  MCS  shown  in  Figure  3.1.  After  sending  its  sum  to  PE  targ, 
a  PE  is  deactivated  for  the  remainder  of  the  computation.  PEs  that  remain  active  are  the  tree  parents 
at  the  current  step.  An  active  PE  receives  a  vedue  fi:om  the  inter-PE  communication  network  in  the 
register  memory  variable  rxval.  At  the  end  of  the  loop  body,  the  FU  adds  rxval  and  sxim,  placing 
the  result  back  into  register  memory. 

By  the  end  of  the  last  iteration  of  the  loop,  all  PEs  except  PE  0  are  asleep.  The  instruction 
following  the  loop  body  awakens  the  PEs,  removing  the  contexts  pushed  during  the  loop’s  execution. 


3.4  Representations  of  Multi-Chip  Subsystems 

Each  MCS  may  comprise  control  and  pin  time-sharing  sub-circuits  within  the  PE  chip,  inter-chip 
wires,  and  components  contained  in  chips  other  than  PE  chips.  The  PE’s  interface  to  an  MCS 
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From  Global  Instruction  Broadcast  Network 


consists  of  input  and  output  registers  containing  data  sent  out  and  received  through  the  subsystem. 
The  numbers  of  input  and  output  registers  in  the  PE  for  each  MCS  depend  on  t^e  MCS  function; 
either  (but  not  both)  of  these  numbers  is  0  in  some  cases. 

The  interface  registers  associated  with  each  MCS  make  each  MCS  behave  like  the  FU,  from  the 
point  of  view  of  activity  on  the  PE  busses:  "operands”  are  stored  into  input  registers  from  busses, 
a  designated  "operation”  is  performed,  and  at  the  completion  of  that  operation  a  "result”  may  be 
driven  onto  a  bus  from  the  output  register.  Figure  3.7  depicts  this  analogy  between  MCS  and  FU. 
Whereas  the  FU  performs  arithmetic  or  logical  calculation  using  an  ALU  contained  in  the  PE  chip,  an 
MCS  performs  local  external  memory  access,  inter-PE  communication,  system  data  memory  access, 
transmission  of  a  logical  value  to  the  system  controller,  or  delivery  of  a  broadcast  literal,  using 
circtdts  that  include  inter-chip  wires. 

The  abstraction  of  MCSs  as  FU  equivalents  simplifies  programming,  in  that  it  allows  an  assembly 
language  program  to  specify  MCS  activity  as  register-to-register  PE  operations,  without  regard  to 
the  detailed  implementation  of  the  MCS. 
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Variable 

V 

sum 

tsfgmssk 

tszg 

zxval 

Register 

RO 

R2 

R3 

R4 

R1 

Table  3.1 :  Register  Assignments  for  Tree-Siunmation  Loop  Body 


Table  3.1  shows  a  register  assignment  for  the  variables  used  in  the  tree-8^lmmation  loop.  Fig¬ 
ure  3.8  shows  an  assembly  language  program  for  the  tree-stimmation  loop  using  the  register  as¬ 
signments  shown  in  Table  3.1.  A  XiOAD  operation  is  used  to  fetch  ACir]  from  local  external  memory, 
while  a  ROUTE  operation  specifies  routed  inter-PE  communication.  MCS  operations  resemble  FU 
operations  in  the  code  in  Figure  3.8.  The  code  in  Figure  3.8  also  demonstrates  the  context  manage¬ 
ment  operations  used  to  control  conditional  execution  of  instructions.  (Assembly  language  syntax  is 
detailed  in  Appendix  B.) 


3.5  Constraints  Arising  from  VLSI  Implementation 

At  least  since  the  advent  of  microprocessors,  the  characteristics  of  VLSI  implementation  technique 
have  constrained  computer  throughput.  The  minimum  time  required  for  a  VLSI  circuit  to  change 
the  voltage  on  a  capacitor  through  a  wire  grows  with  both  the  driven  capacitance  and  the  length  of 
the  wire.  A  characteristic  of  VLSI  circuits  is  that  capacitances  tend  to  be  greater  and  wire  lengths 
tend  to  be  longer  between  chips  than  within  chips.  Therefore,  a  key  property  of  VLSI-based  systems 
is  that  intra-chip  circuits  tend  to  be  faster  than  multi-chip  circuits. 

A  VLSI  manufacturing  process  determines  the  maximum  number  of  transistors  that  can  fit  into 
a  chip,  although  wires  that  inter-connect  transistors  typically  cause  the  actxial  niimber  of  transistors 
realized  in  a  chip  to  be  well  below  the  maximum-  The  maximum  niunber  of  transistors  is  a  function  of 
the  process  resolution  (characterized  by  the  length  unit  A  [58Kp.48))  and  of  the  physical  dimensions  of 
a  chip  that  is  manufactiirable  with  acceptable  yield.  Commercial  factors  tend  to  cause  A  to  decrease 
and  the  physical  size  to  increase  over  time  [23].  The  constituents  of  the  PE  chip  of  a  SIMD  computer 
compete  for  the  limited  resoxmoes  available  within  the  chip.  As  suggested  in  Figures  3.2  and  3.4, 
these  resources  are  shared  among  the  FU,  context  manager,  and  registers  of  each  PE,  MCS  local 
control  and  interfaces,  and  the  local  controller. 

Chips  have  limited  numbers  of  external  connections.  If  the  input  and  output  pads  to  which  inter¬ 
chip  wires  are  attached  are  arrayed  around  the  periphery  of  a  chip  of  edge  length  N,  then  there 
are  0  (.N")  such  pads.  This  number  grows  more  slowly  with  N  than  does  the  maximiim  number  of 
transistors  per  chip,  which  is  O  (^)^.  As  VLSI  implementation  technique  improves,  N  grows  while  A 
decreases.  Therefore,  the  ratio  of  transistors  to  pads  arranged  around  the  periphery  grows  as  O  ( ^ ) . 
The  placement  of  I/O  structiu%s  aroimd  the  periphery  is  not  necessary  for  VLSI  implementation  of 
the  PE  chip;  however,  it  is  characteristic  of  inexpensive  present-day  chips. 

A  large  proportion  of  the  chips  in  a  SIMD  computer  are  PE  chips.  As  indicated  in  Figure  3.2,  each 
PE  chip  is  accompanied  by  a  set  of  memory  chips.  Other  chips  are  sometimes  required  for  inter-PE 
communication,  for  system  data  memory  access,  for  problem-specific  I/O,  or  for  transmitting  status 
information  to  the  system  controller.  The  continual  use  of  a  global  instruction  broadcast  subsystem 
distinguishes  a  SIMD  computer  fix)m  other  multiprocessors.  The  global  instruction  broadcast  net¬ 
work  fans  out  firom  the  system  controller  to  all  PE  chips  in  the  system.  The  total  capacitive  load 
presented  by  the  PE  chips’  global  instruction  receiver  pins  is  large,  and  the  worst-case  distance  of 
these  pins  fi*om  the  driver  is  likely  to  be  as  large  as  that  for  any  signal  driven  in  the  SIMD  computer. 
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program  tree^uiii-loop(logP,ULpi) : 

!  The  tree-sunnation  loop.  Paramehera: 

!  logP  is  log(P),  P  the  number  of  array  elements  and  PBs 
!  &A^1  is  the  address  in  local  external  ateaK>ry  of  A, 

IJ>X  ntO  '0'  ;  R1  s  LITBRAL('CJLpi' ) 

;  R2  s  LOAD(Rl)  !  Initialize  sum 

;  R3  =  PASS('-1M  !  Initialize  targ 

CJSR  FORC  LOOP  'logP  -  1'  ;  !  logP  iterations  at  label  LOOP 

;  [CIA]  !  Awaken  all  the  PBs 

HALT  ; 

LOOP: 

;  R3  3  LSHirT(R3, '!') 

;  R4  s  A11D(R0,R3) 

;  LC-Pt7SHJBQ(R4,R0)  !  Sleep  unless  targ  =»  r 

;  R4  a  PASS  {'-!')  !  targ  =  -1  =>  send  to  noone 

;  [POP]  R1  s  ROOTB(R2,R4)  !  perform  inter-PB  communication 
;  LCRUSBJtlB(R4, '-1' }  !  Remain  awake  only  if  a  parent  this  step 

;  R2  =  ADD(R1,R2) 

LTST  ICTO  LOOP  ;  !  iterate  until  loop  counter  reaches  0 


Figure  3.8:  Assembly  language  for  the  tree-summation  loop.  Each  line  specifies  one  assembly 
language  instruction,  lb  the  left  of  the  semicolon  is  a  system  controller  instruction,  to  the  right  is  a  PE 
instruction.  Exclamation  point  begins  a  comment  that  continues  to  end  of  line.  The  PE  instructions 
of  the  form  LC  JUSH.  create  a  new  context  in  whidi  the  PE  is  conditionally  asleep,  depending  on  the 
value  of  the  specified  condition  in  the  PE.  The  part  of  the  PE  instruction  in  brackets  also  manipulates 
local  context,  reverting  to  the  previous  context  (in  the  case  of  [POP] )  or  imconditionally  waking  up 
the  PE  (in  the  case  of  [CIA] ).  Ihe  system  controller  instruction  CJSR  begins  executing  the  loop  body, 
while  the  system  controller  instruction  LTST  performs  the  loop  completion  test. 
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3.6  Operation  Stepcounts 


Limited  chip  area  means  that  there  might  not  be  as  much  chip  area  available  as  desired  in  the 
PE  chip  for  FUs  and  register  files.  Also,  limited  pins  means  that  there  may  not  be  as  many  pins 
available  as  desired  per  MCS.  A  common  compromise  introduced  by  such  resource  constraints  is 
to  time-miiltiplex  the  available  resources.  FU  or  MCS  operations  that  are  time-multiplexed  take 
multiple  clock  cycles  to  complete. 

One  way  that  FU  time-multiplexing  arises  is  when  the  FU  has  a  width  (in  bits)  that  is  less 
than  the  width  of  the  operands.  For  example,  the  MP-1  PE’s  4-bit  FU  adds  32-bit  integers  in  8 
steps  [62].  Another  way  that  FU  time-miiltiplexing  arises  is  when  the  circuit  complexity  of  the  FU 
is  less  than  that  required  for  a  given  operation.  For  example,  there  may  not  be  sufficient  chip  area 
for  a  combinational  multiplier,  and  multiplication  may  be  carried  out  as  a  sequence  of  additions. 
The  SLAP  PE  provides  an  example,  wherein  its  16-bit  FU  (with  a  built-in  Booth’s  bit-pair  recoder) 
multiplies  16-bit  integers  in  8  steps  [27]. 

Depending  on  the  degree  of  time-sharing  of  PE  chip  pins  and  on  network  design,  an  MCS  may 
use  multi-step  sequences  to  perform  its  data  transfers.  For  example,  local  external  memory  access 
through  a  shared  port  requires  a  number  of  steps  proportional  to  ffie  number  of  PEs  in  the  chip.  As 
another  example,  the  SLAP  I/O  subsystem  delivers  a  new  datum  to  each  PE  in  a  number  of  steps 
proportional  to  the  total  number  of  PEs  [25]. 

The  need  for  multiple  steps  to  cany  out  an  operation  is  the  principal  difference  between  assembly 
language  instructions  and  machine  code  instructions.  An  assembly  language  instruction  specifies 
an  FU  or  MCS  operation.  Assembly  language  semantics  are  sequential,  in  which  each  instruction’s 
operation  has  completed  before  the  next  instruction  begins  executing.  Machine  code  semantics,  by 
contrast,  allow  that  multiple  machine  dock  cydes  are  sometimes  needed  to  perform  a  given  operation. 

To  make  it  possible  to  represent  SIMD  computers  that  vary  in  these  ways,  a  stepcount  parameter 
is  assodated  with  every  assembly  language  FU  and  MCS  operation.  FU  operation  stepcoxints  char¬ 
acterize  FU  bit-width  and  circuit  complexity  relative  to  the  requirements  of  application  data,  while 
MCS  operation  stepcounts  characterize  PE  chip  pin  sharing  and  inter-chip  network  complexity.  The 
stepcount  of  an  operation  gives  the  number  of  machine  dock  cydes  in  the  underlying  SIMD  computer 
required  to  perform  it.  As  a  simplification,  the  model  prohibits  pipelined  operations. 

For  some  operations,  the  actual  number  of  steps  depends  on  the  specific  data  involved.  This 
dependence  occxirs,  for  example,  in  logical  word  rotation  using  a  distance-1  barrel  shiiter.  This 
dependence  also  occurs  in  the  general  case  of  routed  inter-PE  communication.  Assigning  fixed 
numbers  of  dock  cydes  to  operations  is  sometimes  an  imperfect  approximation  that  compromises  the 
validity  of  results.  The  sensitivities  of  results  to  these  approximations  are  identified  and  compensated 
where  appropriate  in  the  analysis  of  measured  results. 

As  an  example  of  how  stepcounts  characterize  a  SIMD  computer,  the  following  is  a  list  of  step- 
counts  for  some  operations  of  a  computer  operating  on  32-bit  integers  with  16-bit  PEs  packed  4  to 
a  PE  chip  (as  in  SLAP  [26])  and  inter-coimected  through  a  three-stage  permutation  network  (as  in 
GFll  [6]): 
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Fig^ure  3.9;  Machine  code  for  the  tree  siunmation  loop.  Here,  Psl024,  the  array  element  AM 
happens  to  be  at  address  17  in  PE  memory  (so  &A-pi  ^17),  operands  are  32  bits  wide,  PE  FUs  are  16 
bits  wide,  there  are  4  PEs  per  chip,  and  inter-PE  commimication  uses  a  3-stage  permutation  network. 


Operation  Name 

Meaning 

Stepcount 

LITERAL 

Broadcast  literal 

2 

AND 

Logical  and 

2 

ADD 

Addition 

2 

LSHIFT 

Left  shift 

2 

LC-PUSH_EQ 

Push  new  context; 

PE  is  active  if  operands  are  equal 

2 

LC-PUSHJSTE 

Push  new  context; 

PE  is  active  if  operands  aren’t  equal 

2 

PASS 

Logical  identity 

2 

LOAD 

Local  external  memory  load 

4 

ROUTE 

Routed  inter-PE  communication 

3 

Using  the  stepcounts  in  the  above  table  and  assiuning  that  P=1024  and  that  array  element  A-^ 
lies  at  address  1 7  in  PE  memory,  the  assembly  language  program  in  Figxire  3.8  results  in  the  machine 
code  program  shown  in  Figm^  3.9.  (Machine  code  is  detailed  in  Appendix  A.) 
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3.7  PE  Chip  Model 

For  sc£dable  data-parallel  problems,  throughput  is  proportional  to  the  number  of  PEs.  Tb  maximize 
the  throughput  for  scalable  data-parallel  problems,  SIMD  computer  architecture  aims  to  maximize 
the  number  of  PEs  realized  in  a  given  total  chip  area.  To  that  end,  the  PE  of  a  SIMD  computer  is 
specialized,  consisting  mostly  of  FU  and  registers  (as  sketched  in  Figure  3.4).  This  specialization 
allows  most  of  the  chip  area  occupied  by  the  PE  to  contain  the  components  needed  to  perform 
calculations.  PE  chip  payload  is  a  chip-area  measure  that  expresses  the  "amount  of  PE”  contained 
in  a  PE  chip.  This  section  presents  a  generic  model  of  a  PE  chip,  develops  a  formula  for  payload, 
and  applies  the  formula  to  a  niunber  of  examples.  Payload  is  a  useful  way  to  compare  PE  chips 
made  using  different  VLSI  implementation  techniques.  As  shown  in  Chapter  4,  payload  provides  a 
uniform  quantitative  basis  for  estimating  the  cost  of  I-cache. 

It  is  difficult  to  find  a  uniform  payload  metric,  becaiise  there  are  many  alternative  implementa¬ 
tions  of  a  PE  chip  containing  PEs  of  a  given  architecture.  VLSI  implementation  technique  encom¬ 
passes  alternatives  in  logic  designs,  in  circuit  designs,  in  geometric  layout,  and  in  chip  fabrication 
process.  Diverse  logic  structures  for  MOS  chips  are  presented  systematically  in  [83].  As  an  example 
of  the  range  of  circuit  design  techniques,  circuits  with  active  storage  elements  (flip-flops)  are  usually 
easier  to  design  than  their  counterparts  with  passive  storage  elements  (capacitances),  although  the 
former  tend  to  be  larger.  As  an  example  of  the  range  of  layout  techniques,  automated  layout  of  circuits 
is  typically  easier  than  the  manual  alternative,  although  the  latter  method  tends  to  yield  smaller 
and  faster  drcmts.  VLSI  fabrication  processes  vary  in  geometric  resolution,  in  design  rules,  in  the 
physical  sizes  of  switching  devices,  and  in  the  electrical  characteristics  of  the  switching  devices. 

The  diversity  of  VLSI  implementation  techniques  defies  uniform  payload  comparisons  among  PE 
chips.  However,  assuming  similar  design  techniques  have  been  used,  it  is  possible  to  compare  the 
payloads  of  chips  based  on  their  physical  characteristics  alone.  The  parameters  H  and  W  represent 
the  physical  dimensions  of  the  PE  chip  in  mm.  Typical  values  for  H  and  W  for  microprocessors  today 
range  finm  about  10mm  to  more  than  16mm.  The  parameter  A  represents  the  geometric  resolution 
of  the  VLSI  fabrication  process  in  ^m  [58](p.48).  Typical  values  of  A  for  processes  used  to  make 
microprocessors  today  range  finm  as  low  as  0.3/xm  to  as  high  as  l.O/xm. 


Wmm 


site 


To  approximate  its  area,  a  chip  can  be  thought  to  contain  a  grid  whose  resolution  is  A  x  A,  as 
sketched  in  Figure  3.1 0.  Each  grid  square  is  a  site  on  which  can  be  placed  integrated  circuit  elements, 
including  switching  devices  or  wires.  A  site  is  A/xm  on  a  side  and  occupies  chip  area  A^  x  lO'^^m. 
The  total  number  of  sites  in  a  computer  system  is  sometimes  called  its  “grain  size”  [70](p.l252).  The 
total  niunber  of  sites  on  a  chip  is  shown  in  Equation  3.1. 
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#  sites 
chip 


The  total  ntimber  of  sites  on  a  chip  is  determined  solely  by  VLSI  &brication  process  characteristics. 
This  number  ranges  over  about  an  order  of  magnitude  for  current  chips,  from  shout  10^  to  just  beyond 
10®. 

The  site  is  a  unit  of  chip  area  that  scales  with  H,  W,  and  A.  For  a  given  niimber  of  sites  per 
switching  device,  the  number  of  devices  scales  with  the  numbe]>of-sites  measure  of  variations  in 
VLSI  fabrication  process  analogously  to  the  scaling  of  circuit  geometries  with  the  linear  resolution 
parameter  A. 

Although  the  “amount”  of  PE  realized  in  the  available  chip  area  depends  on  the  entire  VLSI 
implementation  technique,  PE  chip  payload  is  proportional  to  the  number  of  sites  available  for  PEs. 
n  denotes  payload,  and  11  is  defined  in  Equation  3.2: 


Emm  *  IVmm 
(A^m)2 
HW 


A2 


xlO® 


(3.1) 


n  =  #  of  PE  chip  sites  used  for  PEs  (3.2) 

The  chip  area  occupied  by  a  PE  depends  both  on  the  PE  architecture  and  on  the  VLSI  implemen¬ 
tation  technique.  PE  architecture  encompasses  such  details  as  datapath  width,  register  coiint,  MCS 
interface  design,  and  FU  circuit  complexity^. 

Existing  SIMD  PE  chips  are  typically  organized  as  linear  arrays  of  bit  slices.  Figure  3.11  illus¬ 
trates  such  a  PE  chip  organization.  On-chip  linear  arrays  are  used  in  the  MP-1  PE  chip  (pictured 
in  [56]),  in  its  precursor  designed  at  DEC  (pictured  in  [33]),  and  in  the  SLAP  PE  chip  (pictured 
in  [27]).  The  PE  area  in  the  Blitzen  PE  chip  is  tiled  with  a  two-dimensional  array  of  ALU-memory 
pairs,  as  shown  in  [36].  The  Blitzen  PE  chip  is  equivalent  to  a  linear  array  of  bit  slices,  wherein  each 
bit  slice  consists  of  FU  and  registers  alternating  in  the  vertical  dimension.  The  four  examples  men¬ 
tioned  here  are  representative  of  the  small  niunber  of  SIMD  PE  chips  for  which  microphotographs 
of  working  chips  have  been  published.^ 

Figure  3.11  suggests  that  I/O  structures  occupy  a  ring  around  the  perimeter  of  the  PE  chip.  This 
placement  of  I/O  structiu^s  is  not  necessary  for  VLSI  implementation  of  the  PE  chip,  although  it 
is  characteristic  of  inexpensive  chips  frtbiicated  using  present-day  VLSI  implementation  technique. 
The  sites  within  this  perimeter  are  not  available  for  PEs.  Figure  3.11  reflects  that  assumption  that 
the  thickness  of  the  I/O  ring  is  bOOftm,  as  happens  to  be  the  case  for  each  of  the  three  PE  chips 
examined  in  detail  in  this  Chapter.  Given  these  assrimptions,  Ekpiation  3.3  defines  /,  the  interior 
chip  area  of  the  PE  chip  that  is  available  for  PEs: 


/ 


interior  chip  area 


PE  chip 

(i7-l)  0^-1) 
A2 


X  10®  sites 


(3.3) 


'  The  circuit  complenty  of  the  FU  refers  loosely  to  the  amount  of  arithmetic  work  the  FU  performs  in  a  single  operation. 
For  example,  a  combinational  multiplier  array  performs  multiplication  in  a  single  clock  cycles’  operation,  whereas  an  adder 
with  a  built-in  multiply  step  performs  multiplication  over  a  sequence  of  clock  cycles.  A  combinational  multiplier  has  greater 
complexity  than  a  simple  adder,  which  typically  implies  that  the  multiplier  would  occupy  greater  chip  area  and  operate  at 
a  lower  maximum  rate. 

^Floor  plans  for  other  SIMD  PE  chips,  including  those  shown  in  [87]  and  in  [68],  suggest  on-chip  linear  array  organization. 
CLIP?  A,  an  early  VLSI-based  SIMD  computer,  contains  just  1  PE  within  its  relatively  small  PE  chip  [32]. 
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Figure  3.11:  Generic  SIMD  PE  Chip  Floor  Plan 
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Ideally,  the  interior  of  the  PE  chip  would  be  used  entirely  for  payload.  In  practical  SIMD  comput¬ 
ers,  some  proportion  of  the  interior  chip  area  is  occupied  by  the  local  controller  and  by  MCS  control 
and  pin-access  drcviits,  as  suggested  in  Figure  3.11.  A  denotes  the  interior  chip  area  that  is  occupied 
by  the  local  controller  and  by  MCSs,  while  n  denotes  the  remaining  interior  chip  area.  A  and  11  are 
related  to  /  as  per  Eqxiation  3.4: 


/  s  non-PE  interior  chip  area  payload  chip  area  (3.4) 

*  >1  +  n  sites 

Whereas  /  depends  only  on  VLSI  implementation  technique,  A  depends  also  on  local  controller 
design  and  on  MCS  interface  reqviirements.  Therefore,  the  payload  11  depends  severally  on  local 
controller  design,  on  MCS  interface  requirements,  and  on  VLSI  implementation  technique. 

Substituting  for  /  firom  Equation  3.3  into  Equation  3.4  }rields  the  the  PE  chip  payload  formula 
shown  in  Equation  3.5: 


n 


#  of  PE  chip  sites  used  for  PEs 

interior  area  -  non-PE  interior  area 
iH-l)  iW-1) 


A2 


-  A  sites 


(3.5) 


Equation  3.5  gives  a  definition  for  PE  chip  payload  that  is  independent  of  PE  architecture. 
Equation  3.5  suggests  that  payload  depends  only  on  the  following  characteristics  of  the  PE  chip: 


•  VLSI  implementation  technique, 

•  MCS  interface  requirements, 

•  and  local  controller  design. 


Note  that  only  the  last  of  these  characteristics  changes  when  I-cache  is  added  to  the  PE  chip. 

Equation  3.5  gives  the  payload  n  as  a  function  of  H,  W,  A,  and  non-PE  interior  area  A.  It 
is  possible  to  calculate  the  payload  of  a  PE  chip  given  a  microphotograph  of  the  chip,  if  the  VLSI 
fabrication  process  parameters  are  known.  Published  microphotographs  of  PE  chips  reveal  fioorplans 
that  conform  roughly  to  that  sketched  in  Figure  3.11,  with  the  exceptions  that  the  MCS  interiaces 
occupy  an  annular  ring  just  inside  the  I/O  ring,  and  the  local  controller  occupies  a  vertical  strip 
down  the  center  of  the  chip  instead  of  being  to  one  side  of  the  PE  area  as  shown  in  Figiure  3.11 .  The 
dimensions  of  the  local  controller  and  the  MCS  interfaces  can  be  measxued  in  the  microphotograph 
using  a  ruler.  Adding  together  these  measured  dimensions  and  then  dividing  by  the  total  interior 
area  yields  an  estimate  of  the  firaction  of  the  interior  area  iised  for  MCS  interfaces  and  for  the  local 
controller.  Miiltiplying  this  firaction  by  the  interior  area  (/,  defined  in  Equation  3.3)  yields  non- 
PE  interior  area  A.  Finally,  substituting  this  value  for  A  and  the  known  physical  parameters  into 
Equation  3.5  yields  an  value  for  PE  chip  payload  n. 

Table  3.2  shows  a  summary  of  the  physical  parameters  and  payload  for  four  PE  chips:  SLAP, 
MP-1 ,  Blitzen,  and  ALAPH.  The  first  three  of  these  chips  have  been  described  in  the  literature,  while 
the  fourth  is  hypothetical,  based  on  current  VLSI  process  technology. 

The  SLAP  PE  chip  contains  4 16-bit  PEs  [27],  organized  as  one  row  of  64  bit  slices.  Inspection  of 
the  SLAP  chip  (shown  in  the  photo  in  [27])  indicates  that  the  local  controller  and  the  MCS  interfaces 
together  occupy  about  36%  of  the  interior  areeu 
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SLAP 

MP-1 

Bhtzen 

ALAPH 

H(mm) 

9.2 

9.5 

11.0 

13.9 

W  (mm) 

7.9 

11.6 

11.7 

16.8 

A(/im) 

1.0 

0.8 

0.5 

0.375 

interior  area  I  (xlO®  sites) 

57 

141 

428 

1400 

non-PE  area  firaction  j  (%) 

36 

17 

17 

non-PE  area  A  (xlO®  sites) 

20 

23 

72 

100 

payload  11  (xlO^  sites) 

36 

118 

356 

1300 

PE  FU  width  (bits) 

16 

4 

1 

fraction  of  PE  bit-slice 
occupied  by  registers  (%) 

20 

34 

60 

Table  3.2:  Physical  Parameters  and  Payload  Estimates  for  Four  SIMD  PE  Chips 


The  MP-1  PE  chip  contains  32  4-bit  PEs  [62],  organized  as  one  row  of  128  bit  slices.  Inspection 
of  the  MP-1  chip  (shown  in  the  large  color  photo  in  [56])  indicates  that  the  local  controller  and  the 
MCS  interfaces  together  occupy  about  17%  of  the  interior  area. 

The  Blitzen  PE  chip  contains  128 1-hit  PEs  [36],  organized  as  8  rows  of  16  PEs  per  row.  Inspection 
of  the  Blitzen  chip  (shown  in  the  large  photo  in  [37])  indicates  that  the  local  controller  and  the  MCS 
interfaces  together  occupy  about  17%  of  the  interior  area. 

The  ALAPH  PE  chip  is  a  hypothetical  one  whose  parameters  are  taken  from  the  VLSI  process 
used  for  the  Alpha  microprocessor  [21 1.  The  estimate  for  ALAPITs  non-PE  area  A  of  100  x  10^  sites 
is  conservative  given  the  values  for  the  existing  chips.  ALAPH  would  contain  two  or  more  32-bit 
PEs,  each  with  at  least  2000  registers. 

Table  3.2  shows  the  proportion  of  the  PE  bit  slice  occupied  by  registers  in  the  three  existing  PE 
chips.  It  is  interesting  to  note  that  the  proportion  of  chip  area  allocated  to  registers  increases  as  the 
FU  bit- width  decreases. 

Another  interesting  feature  of  Table  3.2  is  that  the  ratio  j  happens  to  be  17%  in  both  the  Blitzen 
and  MP-1  PE  chips,  whereas  SLAFs  non-PE  area  represents  twice  that  fraction.  There  are  two  likely 
sources  of  this  disparity: 

1 .  Multi-chip  subsystem  interface  complexity. 

Local  external  memory  access  in  Blitzen  and  in  MP-1  has  the  characteristic  that  the  system 
controller  provides  1  address  through  the  global  instruction  broadcast  network  that  is  used 
for  all  PEs’  addresses.  By  contrast,  the  SLAP  PEs  each  supply  their  own  addresses  to  local 
external  memory.  For  this  reason,  the  local  external  memory  interface  within  the  SLAP  PE 
chip  is  occupies  more  chip  area  than  its  counterparts  in  the  other  two  PE  chips. 

2.  Local  controller  complexity. 

Whereas  the  BUtzen  and  MP-1  local  controllers  occupy  neghgible  area  in  their  respective  PE 
chips,  the  SLAP  instruction  decoder  occupies  almost  20%  of  the  interior.  This  difference  is  due 
to  the  different  functions  performed  by  the  local  controllers  in  these  chips. 

In  all  three  of  these  PE  chips,  machine  code  instructions  are  executed  in  a  pipelined  manner. 
Typically,  there  are  three  execution  phases,  consisting  successively  of: 
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(a)  register  operand  fetch, 

(b)  FU  calculation,  and 

(c)  register  result  write. 

Successive  pipeline  stages  execute  on  successive  PE  clock  cycles.  A  machine  code  instruction 
reqxiires  three  dock  cydes  to  complete,  and  on  any  given  dock  cyde,  three  successive  machine 
code  instructions  are  in  execution.  The  instruction  provided  to  the  PEs  within  the  PE  chip  on 
each  dock  (^de  controls  a  single  clock  cycle’s  activity,  and  therefore  has  contributions  from  each 
of  the  three  instructions  currently  in  execution. 

In  SLAP,  the  system  controller  broadcasts  machine  code  instructions.  The  SLAP  PE  chip’s  local 
controller  decomposes  each  arriving  instruction  into  its  partial  contributions  on  each  of  three 
successive  PE  dock  <7des  to  the  single-cycle  control  word  provided  to  the  PEs.  In  Blitzen  and 
in  MP-1,  by  contrast,  the  S3rstem  controller  broadcasts  single-cyde  PE  instructions.  Instead  of 
broadcasting  the  machine  code  instructions  themselves,  the  system  controller  assembles  each 
globally  broadcast  instruction  from  the  relevant  parts  of  the  three  machine  code  instructions 
currently  in  execution. 

lb  perform  its  pipeline  control  function,  the  SLAP  local  controller  contains  pipeline  staging 
registers  and  decode  logic  that  are  not  prei^nt  in  the  Blitzen  and  MP-1  local  controllers.  The 
design  choice  in  SLAP  for  the  local  controller  to  perform  pipeline  control  decoding  represents 
a  departure  &om  strict  SIMD  computer  architecture.  As  for  other  program-control  functions, 
this  decoding  is  more  effidently  performed  once  for  all  PEs  in  the  system  controller,  rather  than 
redundantly  replicatedly  within  each  PE  chip.  The  relatively  large  instruction  decoder  in  SLAP 
highlights  the  large  chip-area  cost  paid  for  non-trivial  instruction  decoding  within  the  PE  chip. 


3.8  How  Large  is  pb? 

Pit  expresses  the  ratio  of  maximum  PE  clock  rate  to  global  instruction  broadcast  rate.  The  speedup 
due  to  I-cache  depends  on  the  value  of  pb>  whidb  is  in  turn  determined  by  characteristics  of  chips 
and  their  interconnections.  There  is  a  range  of  possible  values  of  pb.  In  some  existing  SIMD 
computers,  pb  appears  to  be  greater  than  10.  Detailed  estimates  suggest  that  pb  would  be  at  least  2  in 
practical  SIMD  computers  made  with  foreseeable  implementation  technique.  High-rate  instruction 
broadcast  incurs  significant  PE  chip  pin  and  board  wiring  costs.  Minimizing  these  costs  leads  to 
lower  instruction  broadcast  rates,  so  that  a  new  SIMD  computer  whose  resources  are  concentrated 
in  the  PE  chips  instead  of  in  the  instruction  broadcast  network  might  exhibit  pb  even  higher  than 
observed  in  existing  SIMD  computers. 

PE  clock  rate  depends  solely  on  the  characteristics  of  the  PE  chip,  whereas  global  instruction 
broadcast  rate  depends  on  the  size  of  the  computer  and  on  the  electrical  characteristics  of  the  broad¬ 
cast  network.  For  a  given  PE  architecture,  Uie  PE  clock  rate  is  determined  within  a  fairly  narrow 
range  by  the  VLSI  implementation  technique.  By  contrast,  the  sizes  of  existing  and  foreseeable  SIMD 
computers  vary  widely.  Variations  in  computer  size  and  in  system-level  implementation  techniques 
give  corresponding  variations  in  the  broadcast  rate.  For  example,  a  very  small  SIMD  computer  might 
fit  within  a  single  chip,  wherein  instructions  are  delivered  to  the  PEs  at  the  high  on-chip  rate.  At 
the  other  size  extreme,  a  very  large  SIMD  computer  would  fill  a  large  chassis,  wherein  the  rate  of 
instruction  broadcast  might  be  considerably  lower  than  the  PE  dock  rate.  While  it  is  possible  in 
prindple  to  construct  a  network  that  broadcasts  instructions  at  any  reasonable  PE  dock  rate  for 
a  SIMD  computer  of  any  size,  so  doing  requires  precise  matching  of  wire  lengths  and  of  electrical 
component  characteristics.  For  some  SIMD  computers,  such  precision  is  prohibitively  expensive. 
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Maximum  PE  clock  rate  is  determined  by  VLSI  implementation  technique  and  by  PE  zuxhitecture. 
Current  single-chip  systems,  including  microprocessors,  provide  a  basis  for  estimating  the  PE  clock 
rate  achievable  with  readily  foreseeable  VLSI  implementation  technique.  For  exeimple,  a  32-b  PE 
made  using  recent  CMOS  VLSI  implementation  technique  should  operate  at  rates  well  beyond 
the  rate  of  200MHz  achieved  for  a  64-b  microprocessor  [21  ],  while  a  32-b  PE  made  using  recent 
BiCMOS  VLSI  implementation  technique  should  at  least  equal  the  300MHz  achieved  for  a  32-b 
microprocessor  [47].  SimpHcity  of  function  typically  shortens  critical  paths  and  allows  higher  clock 
rates.  For  example,  operation  rates  of  one-bit  arithmetic  components  have  been  measured  in  excess 
of  250MHz  using  3/im  CMOS  [89],  while  a  333MHz  32-b  adder  has  been  implemented  in  O.S^m 
BiCMOS  [35].  These  examples  suggest  that  the  clock  rate  for  a  new  PE  would  he  between  200MHz 
and  300MHz,  or  possibly  even  higher  if  the  function  unit  operates  on  narrow  words. 

Having  based  PE  clock  rates  on  ihose  of  microprocessors,  it  makes  sense  to  extrapolate  &om  the 
disparity  between  on-chip  and  ofif-chip  clock  rates  observed  for  modem  microprocessors  to  estimate 
Pb  directly.  Most  modem  microprocessors  operate  internally  at  least  two  times  faster  than  their 
external  memory  interfaces  [21,  88,  5].  A  primary  reason  for  this  disparity  between  on-chip  and 
inter-chip  operation  rates  is  that  ofif-chip  wires  are  driven  as  lumped  capacitances:  The  capacitances 
of  inter-chip  wires  are  typically  about  an  order  of  magnitude  greater  than  the  capacitances  associated 
with  on-chip  signals,  and  the  time  needed  for  a  MOS  circuit  to  drive  a  lumped  capacitance  grows 
logarithmically  with  the  capacitance  [58](p.l4). 

The  global  instmction  broadcast  network  in  a  SIMD  computer  is  typically  electrically  larger  and 
far  more  geometrically  comphcated  than  a  microprocessors  external  memory  interface.  Were  the  PE 
chips  made  using  the  microprocessors’  VLSI  implementation  technique  and  the  global  instmction 
broadcast  network  driven  as  a  lumped  capacitance,  then  />b  would  certainly  be  much  larger  than  2. 
For  example,  in  a  SIMD  computer  occupying  a  single  50cm  x  50cm  printed  circxiit  board  (PCB),  each 
bit  of  broadcast  instmction  might  present  a  lumped  capacitance  of  around  2.5nF.  Even  using  a  very 
low-resistance  driver  of  ^n*6f2,  the  instmction  bit  rise  time  would  be  33ns.  AUowing  3  times  the 
rise  time  for  the  bit  interval,  this  corresponds  to  a  maximum  broadcast  rate  of  lOMHz.  Against  a  PE 
clock  rate  of  300MHz,  this  yields  a  pb  of  30. 

In  a  generic  SIMD  computer,  the  PE  clock  rate  equals  the  system  clock  rate,  which  is  determined 
by  the  global  instmction  broadcast  rate.  One  might  expect  the  electrical  design  of  a  SIMD  computer’s 
global  instmction  broadcast  network  to  be  do  better  than  driving  the  instmction  bits  as  lumped 
capacitances,  so  as  to  provide  instmctions  to  the  PEs  at  their  highest  execution  rate.  It  is  therefore 
surprising  that  there  is  in  fact  a  significant  disparity  between  maximum  possible  PE  operation 
rates  and  system  clock  rates  in  existing  SIMD  computers.  The  global  instmction  broadcast  rates  of 
recent  SIMD  computers  range  fixjm  as  high  as  20MHz  to  as  low  as  8MHz,  as  indicated  in  Table  3.3. 
The  clock  rates  indicated  in  Table  3.3  are  significantly  lower  than  the  operation  rates  of  circuits  of 
similar  logical  complexity  to  the  PE  function  units  implemented  using  similar  VLSI  implementation 
techniques.  A  comparison  of  the  operation  rates  for  the  chips  listed  in  Table  3.4  against  those  of  the 
computers  listed  in  Table  3.3  suggests  that  PEs  in  existing  SIMD  computers  execute  instmctions  at 
rates  about  10  times  lower  than  they  otherwise  could.  In  other  words,  pi  appears  to  be  aroimd  10  in 
existing  SIMD  computers. 

High  values  of  pb  for  existing  SIMD  computers  suggest  that  board-  and  chassis-level  engineering 
factors  prevent  high-rate  global  instmction  broadcast.  An  alternative  to  driving  each  instmction  bit 
as  a  single  Iximped  capacitance  is  to  organize  the  global  instmction  broadcast  network  as  a  clocked 
fanout  tree  whose  nodes  drive  smaller  lumped  capacitances.  In  such  a  network,  a  high-rate  clock 
regulates  the  advance  of  instmctions  fimm  the  system  controller  through  a  tree  of  clocked  registers 
to  the  PE  chips.  The  instmction  fans  out  by  a  modest  factor  at  each  successive  level  of  the  tree. 
Assuming  that  the  system  clock  is  distributed  with  sufficiently  low  skew,  instmctions  progress  at  the 
rate  of  inter-chip  signaling.  Of  course,  inter-chip  communication  in  this  network  incurs  module  and 
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System 

Clock-rate 

Technology 

Year 

Single-cycle  PE  Op 

PE  FU  Width 

GFll 

[6] 

20MHz 

mixed 

1985 

floating-point  multiply 

32b 

CM-2 

[16] 

8MHz 

2.0pm  CMOS 

1987 

iadd 

lb 

SLAP 

[28] 

8MHz 

2.0pm  CMOS 

1988 

2-b  multiply  step 

16b 

Blitzen 

[36] 

20MHz 

1.0pm  CMOS 

1988 

add 

lb 

MP-1 

[62] 

14MHz 

1.6pm  CMOS 

1990 

add 

4b 

CM-200  [16, 17] 

lOMHz 

1.5pm  CMOS 

1991 

kadd 

lb 

Table  3.3:  SIMD  Computer  Speeds 


Chip 

Clock-rate 

Technology 

Year 

Single-cycle  Op 

Operand  Width 

Divider 

[86] 

lOOMHz 

2.0pm  CMOS 

1987 

1-b  divide  step 

54b 

Yuan 

[89] 

250MH2 

3.0pm  CMOS 

1989 

count 

4b 

DataWave 

[67] 

125MHz 

0.8pm  CMOS 

1990 

2-b  multiply  step 

12b 

Yuan 

[90] 

700MHz 

2.0pm  CMOS 

1991 

count 

8b 

Alpha 

[21] 

200MHz 

0.375pm  CMOS 

1992 

4-b  multiply  step 

64b 

Hitachi 

[63] 

250MHz 

0.3pm  BiCMOS 

1992 

add 

32b 

Table  3.4:  Chip  Speeds 

board  crossings,  and  the  need  for  all  levels  in  the  distribution  tree  to  be  synchronized  likely  means 
that  Pb>  2.  The  fanout  tree  achieves  fast  instruction  broadcast  at  the  cost  of  high  latency,  which  price 
is  paid  as  PE  idle  time  at  each  global  data-dependent  branch  in  a  program.  The  greatest  drawback 
of  a  clocked  fanout  tree  is  that  it  contains  a  large  number  of  chips  and  wires,  which  resources  could 
otherwise  be  used  for  PE  chips  themselves. 

PCB  traces  typically  have  low  resistance  and  may  be  modeled  as  distributed  inductors  and 
capacitors,  such  that  when  run  over  a  ground  plane  they  may  be  used  as  transmission  lines.  The 
design  of  a  high-speed  broadcast  network  that  contains  fewer  chips  than  a  clocked  fanout  tree  would 
exploit  the  electrical  properties  of  transmission  lines.  The  time  for  a  voltage  step  to  propagate 
through  a  terminated  transmission  line  is  given  as  the  time  of  flight  delay,  t/: 


where  /  is  the  length  of  the  line  and  v  is  the  propagation  velocity,  v  is  typically  on  the  order  of  the 
speed  of  light,  so  transmission  lines  minimize  signaling  delays  between  chips. 

More  important  with  respect  to  high-rate  instruction  broadcast  is  the  fact  that  a  imit  voltage 
step  does  not  deteriorate  as  it  propagates  through  a  uniform  lossless  transmission  Une.  The  rate 
at  which  signals  are  delivered  through  a  transmission  line  is  determined  not  by  the  propagation 
delay  for  a  single  voltage  step,  but  rather  by  the  time  interval  needed  for  a  receiver  to  be  able  to 
distinguish  between  successive  voltage  steps.  Where  the  time  of  flight  (t/)  through  a  transmission 
line  is  significantly  greater  than  the  driver’s  rise  time  (tr),  a  transmission  line  makes  high-rate 
signaling  possible  by  allowing  multiple  voltage  steps  to  be  in  transit  through  the  line  at  any  given 
time. 

The  global  instruction  broadcast  network  is  driven  at  one  point  (at  the  system  controller)  and 
is  received  by  every  PE  chip.  Unlike  typical  one-to-one  inter-chip  signal  paths  or  many-to-many 
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busses  (where  many  is  small),  the  broadcast  network  is  designed  fiar  one-to-many  communication 
where  many  is  large.  Among  well-known  electrical  engineering  problems,  the  design  of  the  broadcast 
network  is  closest  to  the  design  of  a  dock  network. 

The  SIMD  computer's  PEs  operate  in  lock-step,  often  exchanging  information,  lb  minimize 
the  time  required  for  inter-PE  communication,  and  to  avoid  the  myriad  design  issues  related  to 
synchronizing  independent  signals,  assume  that  the  docks  of  the  PEs  are  kept  in-phase.  One 
way  to  keep  the  PEs  in-phase  is  to  distribute  the  system  dock  with  minimum  skew.  Low-skew 
dock  distribution  has  been  widely  studied.  Ordinarily,  low-skew  dock  distribution  demands  careful 
matching  of  signal  path  lengths  in  the  distribution  network  [1  KChap.8).  Some  dock  distribution 
techniques  exploit  the  inherent  regularity  of  a  dock  signal.  One  such  technique  is  to  distribute  the 
clock  as  a  standing  wave,  the  phase  of  which  is  constant  within  regions  boxinded  in  diameter  by  half 
the  wavelength  of  the  standing  wave  [13]. 

The  physical  device  characteristic  mis-matches  that  give  rise  to  skew  in  a  dock  distribution 
network  also  cause  skew  through  the  global  broadcast  instruction  network.  The  time  allowed  for 
each  broadcast  instruction  bit  must  allow  for  variations  in  the  time  for  the  bit  to  arrive  at  each  of  the 
PE  chips.  Although  this  skew  problem  resembles  that  arising  for  the  system  dock,  it  is  more  difficult 
to  solve  for  two  reasons:  First,  a  dock  signal  is  simply  repetitive,  whereas  the  broadcast  instruction 
is  not.  Second,  a  dock  is  typically  only  one  signal,  whereas  an  instruction  contains  many  bits.  This 
multiplidty  exacerbates  the  skew  problem  for  broadcast  instructions  by  increasing  the  number  of 
signals  whose  arrival  times  need  to  be  matched. 

Figure  3.12  illustrates  the  design  of  a  broadcast  network  using  transmission  lines.  If  the  trans¬ 
mission  lines  in  Figure  3.12  are  ideal,  properly  terminated,  and  of  equal  lengths,  then  the  rate  of 
instruction  broadcast  is  determined  by  the  rise  time  L  of  the  driver.  (A  driver  is  indicated  at  point  B 
in  the  Figure  3.12.)  The  time  allowed  per  bit  might  be  3  times  U,  to  ensure  meeting  the  set-up  and 
hold  constraints  on  the  PE  chip  latch  that  receives  the  instruction  bit.  (A  latch  is  indicated  at  point 
D  in  Figure  3.12.)  Using  a  high-speed  ECL  driver,  such  as  MClOOElll  [59]  with  fr=4()()ps,  the  bit 
interval  would  be  1.2ns,  for  a  broadcast  rate  of 833MHz. 

The  network  sketched  in  Figure  3.1 2  would  therefore  provide  instructions  to  the  PEs  as  fast  as  the 
PEs  can  execute  them.  However,  that  network  is  not  practical,  because  it  contains  a  transmission 
line  carrying  each  bit  of  the  instruction  directly  firom  the  system  controller  to  each  PE  chip.  In 
a  SIMD  computer  with  a  32-b  instruction  that  is  delivered  to  each  of  100  PE  chips,  the  network 
sketched  in  Fig;u*e  3.12  requires  driving  thousands  of  transmission  lines  firom  the  system  controller. 
The  number  of  driver  chips  needed,  and  the  density  of  wiring  near  the  system  controller,  make  such 
a  large  number  of  direct  lines  prohibitive. 

A  more  realistic  scenario  is  constructed  by  considering  a  specific  h3qx)thetical  SIMD  computer 
with  the  following  characteristics: 

•  The  broadcast  instruction  is  32  bits  wide. 

•  The  computer  contains  4800  PEs. 

•  The  PEs  are  packed  4  to  a  chip,  so  there  are  1200  PE  chips. 

•  Each  PE  chip  is  mounted  along  with  local  external  memozy  on  a  3cm  x  6cm  multi-chip  module 
(MCM). 

•  The  MCM  and  its  board-level  wiring  take  up  a  4cm  x  8cm  region  on  a  PCB  containing  PEs  (the 
PE  board),  so  the  PE  board  contains  6  rows  with  10  PE  chips  per  row. 

•  The  computer  fits  in  a  single  rack  of  50cm  x  50cm  PCBs. 

•  20  PE  boards  make  up  the  entire  computer,  along  with  a  PCB  containing  the  system  controller. 
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Figure  3.12:  One  way  to  implement  a  very  fast  global  instruction  broadcast  network  uses  a  set  of 
transmission  lines  leading  directly  fix>m  the  system  controller  to  each  PE  chip. 

•  The  21  boards  fit  in  a  rack  that  is  65cm  wide,  with  the  system  controller  occupying  the  middle 
slot. 

Figure  3.13  shows  the  rack  layout.  32  bits  driven  to  each  of  20  PE  boards  means  that  640 
transmission  lines  are  driven  fix)m  the  system  controller  board.  This  is  stUl  a  very  large  number  of 
wires  to  route  firom  the  system  controller,  but  it  may  not  be  prohibitively  large. 

Figure  3.14  shows  the  layout  of  the  PE  board,  and  Figure  3.1 4  shows  how  the  broadcast  instruction 
might  be  delivered  along  one  row  of  the  PE  board. 

A  broadcast  network  is  shown  in  Figure  3.16,  wherein  each  instruction  bit  is  driven  only  once  to 
each  PE  board. 

The  system  controller  generates  instructions  which  are  fanned  out  and  buffered  on  the  system 
controller  board.  The  instructions  are  delivered  to  the  PE  boards  via  transmission  lines.  The  column 
of  fanout  buffers  down  the  middle  of  the  PE  board  receives  the  broadcast  instruction,  fans  it  out,  and 
drives  12  tapped  transmission  lines  half  the  length  of  the  board,  2  such  lines  for  each  of  the  6  rows 
shown  in  Figure  3.14. 

As  shown  in  Figure  3.15,  a  buffer  on  the  PE  board  drives  a  transmission  line  ‘l)us”  that  is  tapped 
by  number  of  PE  chips.  This  arrangement  conserves  the  number  of  driver  chips  in  the  network,  and 
it  further  reduces  the  overall  wiring  complexity  as  compared  to  that  of  the  network  in  Figure  3.12. 

Allowing  3tr  per  bit  for  set-up  and  hold  times  at  the  instruction  latches  in  the  PE  chips,  the 
minimum  interval  of  the  broadcast  instruction  is  given  as 

minimum  bit  interval  =  3tr  +  worst-case  total  skew  (3.7) 

The  contributions  to  skew  are  as  follows: 

skew  components  =  driver  delay  variations  at  system  controller 
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Figure  3.13:  The  layout  of  a  rack  containing  a  4800-PE  SIMD  computer.  The  system  controller  is  in 
the  center  of  the  rack,  fix)m  where  it  broadcasts  instructions  to  the  PEs. 


Figure  3.14:  Layout  of  the  PCB  containing  the  PEs.  The  PE  board  has  a  set  of  low-skew  fanout 
buffers  in  the  center  that  drive  the  broadcast  instructions  to  the  PE  chips  on  the  board. 
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Figure  3.15:  Detail  of  one  row  of  the  PE  board.  The  instruction  is  driven  through  a  tapped  transmis¬ 
sion  line  along  each  row  of  the  board.  The  PE  chips  along  each  row  tap  a  shared  line. 


Figure  3.1 6:  A  more  practical  implementation  of  a  fast  global  instruction  broadcast  network.  Instruc¬ 
tions  generated  on  the  system  controller’s  board  are  distributed  to  the  PE  boards  through  coaxial 
cables  where  they  are  buffered  and  driven  through  tapped  transmission  lines. 
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*  coax  cable  length  variations 
driver  delay  variations  at  PE  board 

*  PCB  trace  length  variations  on  PE  board 
bit  arrival  time  variations  on  PE  board 

•»  dispersion 

rise  time  variations  at  PE  chip  receiver  stubs  (3.8) 

The  following  paragraphs  estimate  the  skew  components: 

The  best  readily  available  driver  is  the  MClOOElll  low-skew  fanout  biifTer  [59].  Although  the 
rise  time  iu)  of  the  MClOOElll  is  jtist  400ps,  single-ended  delays  through  the  drivers  on  different 
chips  vary  by  up  to  400ps. 

There  will  be  some  variation  in  the  lengths  of  the  coax  cables  leading  from  the  system  controller 
to  the  PE  chips.  The  longest  coax  cable  would  need  to  be  at  least  82cm  (half  the  system  controller 
board  length  (25cm)  plus  half  the  length  of  the  rack  (32cm)  plus  half  the  PE  board  length  (25cm)). 
Assuming  that  the  coax  runs  vary  in  length  by  as  much  as  5%,  or  4.1cm,  the  worst-case  skew  arises 
from  a  path  length  difference  of  8.2cm.  Signals  propagate  at  about  20cm/ns  in  coax,  so  a  length 
difference  of  8cm  corresponds  to  a  skew  of 440ps  through  the  coax  cables. 

The  low-skew  fanout  buffer  on  the  PE  board  contributes  another  4(X)ps  skew. 

The  ixistruction  bit  is  received  at  one  point  on  the  PE  board.  From  the  point  of  reception,  the 
instruction  bit  is  fanned  out  and  buffered  within  the  middle  column  of  the  PE  board  shown  in 
Figxire  3.14.  Ultimately,  the  instruction  bit  is  driven  horizontally  from  the  center  of  the  board 
through  the  tapped  transmission  line  shown  in  Figure  3.15.  The  lengths  of  the  PE  board  traces  that 
route  from  the  reception  point  through  the  buffer  to  the  horizontal  drive  point  vary  by  up  to  25cm, 
half  the  length  of  a  50cm  PCB.  The  propagation  velocity  through  a  PCB  trace  is  about  15cm  per 
ns  [1  ](p.241 ).  At  that  velocity,  a  25cm  variation  in  trace  length  yields  a  skew  contribution  of  1 70C^. 

The  use  of  busses  on  the  PE  board,  instead  of  point-to-point  drivers  supplying  each  PE  chip 
directly  with  its  own  copy  of  the  instruction,  conserves  driver  pins  as  well  as  wires  in  the  broadcast 
network.  An  unfortunate  consequence  of  this  economy  is  that  the  time  for  the  broadcast  bit  to  arrive 
at  every  PE  chip  tapping  one  such  bus  differs  by  as  much  as  the  time  of  flight  along  the  bus.  The  bus 
is  25cm  in  length.  The  PE  chip  receiver  pin  at  each  tap  adds  to  the  capacitance  of  the  transmission 
line.  The  capacitance  of  a  PCB  is  typically  about  IpF  per  cm.  Adding  5pF  per  pin  spaced  at  4cm, 
the  PE  chips  increase  that  capacitance  by  a  factor  of  2.2.  The  propagation  velocity  c  through  a 
transmission  line  is  given  as 

1 

v-—== 

y/LC 

where  L  and  C  are  the  total  inductance  and  capacitance  of  the  line.  Multiplying  the  capadtemce 
by  2.2  a  reases  u  by  a  factor  of  1.5,  frnm  15cm  per  ns  to  10cm  per  ns.  Substituting  /-25cm  and 
t;=10cm  per  ns  into  Equation  3.6  gives  tf  along  the  tapped  transmission  line  of  2500ps. 

Unfortimately,  PCB  traces  do  not  make  ideal  transmission  lines.  The  trace  widths  are  not 
perfectly  xiniform,  dielectric  thickness  varies,  and  the  traces  have  finite  series  resistance.  These 
deviations  from  ideality  give  rise  to  dispersion,  whereby  the  propagation  velocities  of  signals  are 
frequency  dependent.  A  step  input  contains  a  range  of  spectral  components,  so  the  step  input  in  fact 
degrades  as  it  propagates  through  a  PCB  trace.  Dispersion  measiires  the  frequency  dependence  of 
propagation  delay  through  an  imperfect  transmission  line.  Dispersion  tends  to  be  greater  at  higher 
frequencies.  Because  of  its  path  dependence,  dispersion  is  a  skew  component  that  is  added  to  the 
time  per  bit.  Dispersion  might  contribute  a  skew  of  up  to  50%  of  the  signal  rise  time,  or  200ps  in  this 
case. 

The  signal  is  not  terminated  inside  the  PE  chips,  so  a  single-ended  instruction  bit  brought  onto 
the  PE  chip  represents  a  capacitive  stub.  (A  PE  chip  pin  is  indicated  at  point  C  in  Figure  3.16.)  The 
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stub  capacitances  may  vary  by  as  much  as  2.5pF  due  to  variations  in  tap  geometries  for  the  various 
bits  of  one  instruction.  Rise  time  of  a  liunped  capacitance  is  2.2RC,  so  a  difiference  up  to  2.5pF 
through  50(1  gives  a  time  difference  of  up  to  225p8. 

Substituting  the  various  skew  estimates  for  Uie  terms  in  Elquation  3.8  yields  the  following  totals: 


worst-case  total  skew  »  400ps 

+  440ps 

♦  400ps 

♦  1700ps 
* 2500ps 

♦  2(X)p8 
■f  225ps 

=  5865ps  (3.9) 

Substituting  tr=4(X)ps  and  the  value  fiom  £)q\iation  3.9  into  Equation  3.7  yields  the  following 
estimate  for  the  instruction  broadcast  interval: 


miTn'miiTTi  bit  interval  =  1200ps  +  5865ps 

as  7.0ns  (3.10) 

At  7.0ns  per  bit,  instructions  are  broadcast  at  about  142MHz.  For  a  PE  clock  rate  near  300MHz, 
this  gives  a  pb  of  about  2. 

The  broadcast  rate  estimates  given  above  neglect  reflections  in  the  broadcast  network.  Parasitic 
impedances,  such  as  are  formed  at  comer  turns  in  PCB  traces  or  by  imperfect  connectors,  cause 
reflections  that  erode  signal  quality  and  lower  the  maximum  distribution  rate. 

Beyond  the  physical  limitations  discussed  above,  there  are  other  practical  considerations  that 
tend  to  keep  the  global  broadcast  rate  low. 

One  consideration  is  the  chip  area  occupied  by  the  broadcast  network.  Economy  of  chip  area  is  the 
principal  engineering  motivation  for  SIMD  computer  architecture.  Driver  chips  used  in  a  fast  global 
instruction  broadcast  network  displace  PE  chips  &om  the  PE  boards,  thereby  eroding  the  chip-area 
economy  of  SIMD  computer  architecture. 

The  speed  of  the  resulting  network  is  in  any  case  limited  in  the  rate  at  which  instructions  are 
provided  by  the  system  controller  (indicated  at  point  A  in  Figures  3.12  and  3.16).  The  system  con¬ 
troller  sequences  instmctions  firom  memoiy,  perhaps  modifying  loop-index-dependent  literal  fields 
before  driving  the  instruction  to  the  broadcast  network.  Broadcasting  instructions  at  a  veiy  high 
rate  reqxiires  the  system  controller  to  be  able  to  supply  them  at  that  high  rate.  High-rate  instruction 
distribution  requires  a  fast  system  controller  whose  cost  may  be  high.  SIMD  computer  architecture 
leverages  the  optimized  design  of  a  PE  chip  through  replication,  but  there  is  only  one  system  con¬ 
troller.  A  high-cost  system  controller  may  not  be  desirable.  If  an  inexpensive  system  controller  were 
used,  instructions  might  then  be  broadcast  at  a  modest  rate.  For  example,  if  the  system  controller 
were  a  commercial  microprocessor,  instructions  might  be  broadcast  over  the  microprocessor's  I/O  bus. 
Low-cost  implementations  of  the  system  controller  and  global  instruction  broadcast  network  tend  to 
yield  large  values  of  p\,. 

Chips  typically  have  limited  niimbers  of  signal  pins.  It  is  advantageous  to  allocate  a  least  number 
of  PE  chip  pins  to  receiving  instructions,  such  that  a  greatest  number  are  available  for  the  other 
inter-chip  communication  reqxiirements  of  the  PEs.  One  way  to  conserve  PE  chip  pins  is  to  time- 
share  the  global  instruction  broadcast  network  receiver  pins.  Time-sharing  the  global  instruction 
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broadcast  receiver  pins  increases  /7b  by  a  fieurtor  equal  to  the  degree  of  tune-sharing.  For  example, 
bit-serial  distribution  of  a  32-bit  instruction  word  increases  pb  by  a  factor  of  32,  while  freeing  31 
PE  chip  pins  for  use  in  other  multi-chip  subsystems.  Bit-serial  instruction  broadcast  has  the  added 
benefit  of  reducing  the  complexity  of  the  skew  problem  for  broadcast  instructions. 

Another  consideration  is  power  consiunption.  Tbnninated  transmission  lines  and  high-current 
bipolar  drivers  typically  dissipate  large  amoimts  of  power.  In  appUcation  contexts  with  small  power 
budgets,  fast  global  instruction  broadcast  using  transmission  lines  as  illustrated  above  would  not  be 
feasible. 

It  might  be  nice  to  be  able  to  upgrade  an  existing  SIMD  computer  by  making  its  PE  chips  faster 
using  newer  VLSI  implementation  technique.  If  the  new  PE  chips  have  the  same  pinouts  as  the 
original  ones,  and  if  power  supplies  and  cooling  are  adequate,  then  such  an  upgrade  would  occur 
simply  by  replacing  the  PE  chips.  As  the  boards  would  be  unchanged,  the  resiilting  increase  in 
is  proportional  to  the  increase  in  PE  clock  rate.  Even  if  pb  was  about  1  in  the  original  computer,  it 
would  be  higher  in  the  upgraded  one. 

Finally,  the  SIMD  computer  architecture  might  be  scalable,  such  that  a  single  PE  chip  design 
is  intended  to  be  used  in  a  range  of  computers  whose  sizes  match  the  varying  size  requirements  of 
a  range  of  problems.  A  fast  global  instruction  broadcast  network  requires  solving  the  board-level 
electrical  design  problems  for  each  instance.  Simpler,  slower  global  instruction  broadcast  networks 
are  not  so  constrained  in  the  size  of  the  network  or  in  the  geometry  of  the  wires,  and  therefore  lend 
themselves  more  readily  to  scaling  of  the  computer. 

Note  that  the  use  of  transmission  lines  to  distribute  instructions  introduces  a  form  of  pipelining  in 
global  instruction  broadcast,  because  a  number  of  instructions  are  in  transit  in  the  network  at  any  one 
time.  When  a  program  specifies  a  branch  whose  outcome  depends  on  intermediate  results  calculated 
in  the  PEs,  the  latency  of  the  branch  as  measiued  in  the  number  of  instruction  times  is  equal  to 
the  number  of  instructions  in  the  broadcast  pipeline.  After  broadcasting  a  branch  instruction,  the 
global  instruction  broadcast  network  fills  up  with  (probably  wasted)  branch  delay  slot  instructions. 
The  number  of  instructions  in  transit  through  the  broadcast  network  is  given  by  the  ratio  of  the 
total  delay  of  the  network  to  the  bit  interval.  If  the  path  length  of  an  instruction  bit  is  about  2()()cm, 
then  the  time  to  propagate  through  the  network  at  15cm  per  ns  would  be  Eibout  13ns,  or  more  than 
8  instruction  times  at  600MHz. 

A  similar  global  branch  latency  arises  using  I-cache.  At  a  global  branch,  the  execution  of  cached 
instructions  effectively  ceases,  awaiting  the  arrival  of  the  first  instruction  following  the  branch.  If 
the  target  instruction  sequence  is  in  cache,  execution  resumes  at  the  high  PE  clock  rate.  Otherwise, 
the  needed  instructions  must  be  delivered  to  the  PE  chips  at  the  low  rate.  Fast  instruction  broadcast 
is  superior  in  this  regard,  because  the  cache  store  overhead  is  never  incxirred,  and  only  the  branch 
delay  slot  instructions  are  potentially  wasted. 

High-rate  global  instruction  broadcast  is  attainable,  although  the  highest  broadcast  rate  is  at  least 
a  small  factor  slower  than  the  highest  PE  clock  rate.  Fcuthermore,  fast  instruction  broadcast  network 
components  compete  with  PE  chips  for  board  real  estate.  I-cache  is  an  architectural  enhancement 
which,  if  effective  and  practicable,  allows  lower-cost  system  controller  and  board-level  designs.  There 
is  a  broad  range  of  reasonable  estimates  for  pb.  High  estimates  for  the  PE  clock  rate  are  extrapolated 
from  existing  microprocessors,  whereas  high  estimates  for  global  instruction  broadcast  rate  are 
speculative  and  have  yet  to  be  demonstrated.  Based  on  these  estimates,  it  appears  that  /7b  is  as  low 
as  2  and  as  high  as  16.  The  mid-point  of  this  range,  Pb=8,  is  a  conservative  estimate  for  pb  hi  existing 
SIMD  computers. 
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3.9  Clock  Intervals  and  p-Sets 

Instruction  broadcast  rate  is  lower  than  PE  clock  rate  because  the  global  instruction  broadcast  net¬ 
work  wires  are  much  longer  and  more  heavily  loaded  than  the  typical  capacitances  driven  inside 
the  PE  chip.  This  observation  applies  not  only  to  global  instruction  broadcast,  but  to  every  MCS, 
because  every  MCS  contains  inter-chip  wires  and  inter-chip  wires  typically  present  greater  driving 
loads  than  intra-chip  wires.  The  top  operation  rate  for  each  MCS  is  determined  by  the  VLSI  im¬ 
plementation  technique  and  by  the  geometries  of  inter-chip  wires.  Inter-chip  wire  geometries  are 
in  txim  are  determined  by  the  MCS  network  topologies  and  by  board-level  wiring  constraints.  It  is 
reasonable  to  assume  that  the  top  operation  rate  for  each  MCS  lies  somewhere  between  the  PE  clock 
rate  and  the  instruction  broadcast  rate. 

The  PEs  in  an  I-cached  SIMD  computer  are  clocked  at  their  highest  rates  irrespective  of  the  rate 
of  global  instruction  broadcast.  Similarly,  the  presence  of  I-cache  means  that  instruction  availabihty 
no  longer  prevents  each  MCS  from  operating  at  its  top  rate.  Therefore,  the  model  that  is  the  basis 
for  evaluation  of  I-cache  should  allow  for  the  possibility  that  the  top  clock  rates  of  the  various  MCSs 
differ. 

Let  tpE  denote  the  interval  of  the  fastest  clock  within  the  PE  chip,  tpg  is  determined  by  PE 
architecture  and  by  VLSI  implementation  technique.  For  example,  tp£  exceeds  the  time  required  to 
drive  the  PE  busses  and  the  time  required  for  the  FU  to  produce  a  single-step  result.  PE  chips  in 
SIMD  computers  tend  to  exhibit  low  circuit  complexity,  as  for  example  do  MP-1  [62],  Blitzen  [36], 
and  SLAP  [27].  tpE  tends  to  be  low,  significantly  lower  than  the  interval  of  the  system  clock  that 
regulates  global  instruction  broadcast.  Large  wire  delays  between  chips  in  an  MCS  mean  that  its 
TninimiiTn  clock  interval  is  larger  than  tps. 

Figure  3.17  shows  a  sketch  of  the  simple  delay  model  used  to  relate  the  intervals  of  the  clocks 
regulating  the  various  subsystems.  The  parameters  of  the  model  sketched  in  Figure  3.17  have  the 
following  interpretations: 

•  tx  represents  the  minimum  propagation  time  through  a  component  of  MCS  X  resident  on  a  chip 
remote  frum  the  PE  chip.  For  example,  represents  local  external  memory  access  time,  while  tc 
represents  packet  forwarding  time  for  router-based  inter-PE  commiinication.  tx  is  determined 
by  functional  complexity  and  by  VLSI  implementation  technique. 

•  Wx  is  the  delay  of  the  wires  connecting  the  PE  chip  to  a  remote  integrated  component  of  MCS 
X.  Wx  represents  the  delay  of  on-chip  wire  drivers  lumped  together  with  electrical  propagation 
delays  along  the  wires  themselves.  Wx  depends  on  wire  geometry,  on  capacitive  loading,  and 
on  the  on-resistance  of  the  transistors  driving  the  MCS’s  wires. 

•  px  represents  the  factor  by  which  the  minimum  interval  of  the  clock  regulating  MCS  X  is  greater 
than  tpE. 

In  the  simple  model  sketched  in  Figure  3.17,  wires  are  driven  as  lumped  capacitances,  with 
the  notable  exception  of  the  global  instruction  broadcast  network  (disciissed  in  Section  3.8).  For 
simplicity,  all  MCS  operation  rates  are  modeled  in  a  uniform  manner  here. 

Assuming  that  tx  and  tpE  are  roughly  equed,  and  that  signaling  through  inter-chip  wires  overlaps 
with  the  operation  of  the  remote  circuits.  Equation  3.11  gives  px- 


PX 


tpE 


(3.11) 


Note  that  if  tpE  is  taken  to  be  1,  then  px  represents  the  minimum  interval  of  a  clock  regulating 
MCS  X.  A  lower  bovmd  for  px  is  obtained  by  comparing  a  lower  bound  tVx  against  an  upper  bound 
for  tpE- 


Figure  3.17:  PSirameters  Determining  the  Relative  Speeds  of  MCSs 
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An  MOS  circuit  can  drive  a  capacitive  load  in  minimum  time  using  an  exponential  horn,  a  series 
of  inverters  of  geometrically  increasing  size,  such  that  the  delay  is  uniform  (and  minimum)  at  each 
successive  stage.  The  ideal  step-up  ratio  has  been  estimated  to  be  e  [58](p.l3),  while  other  estimates 
range  as  high  as  10  [61](p.l61).  Where  the  step-up  ratio  is  r  and  the  (scale-independent)  parasitic 
delay  of  an  inverter  is  p,  then  the  delay  of  one  stage  in  the  horn  is  r  *  p.  The  ideal  step-up  ratio  r  is 
determined  by  the  VLSI  process  and  is  independent  of  the  number  of  stages  [77]. 

Assuming  the  driven  signal  originates  firom  a  minimum  inverter,  the  number  of  stages  is  log^ 
where  Cg  is  the  minimum  inverter's  gate  capacitance.  The  minimum  time  tc^  to  drive  a  capacitive 
load  Cl  is  then  given  by  Equation  3.12; 


^Ct 


(r  ♦  p)  log,.  ^ 

In^ 


(3.12) 


Let  Cx  denote  the  capacitive  load  driven  between  chips  in  MCS  X  and  Con  denote  the  worst-case 
intra-chip  load  capacitance  determining  tpg.  Then  px  is  given  by  Equation  3.13: 


PX 


Cr>p)-i^ 


(3.13) 


Typical  capacitances  for  printed-circuit-board-based  (or  PCB-based)  technology,  are  as  follows; 


Cg  w  lOfF 
Con  <  3pF 
Coff  >  15pF 

Substituting  these  values  into  Equation  3.13  yields  the  rough  lower  bound  in  Equation  3.14; 

PX  >  1.3  (3.14) 


When  inter-chip  wires  are  long  or  where  inter-chip  wiring  networks  are  heavily  loaded  with 
multiple  taps,  px  exceeds  the  lower  bound  given  in  Equation  3.14.  However,  the  top  clock  rates  for 
MCSs  other  than  instruction  broadcast  are  likely  to  be  nearer  to  the  PE  clock  rate  than  the  system 
clock  rate.  Table  3.5  summarizes  the  delay  model  terms  for  each  MCS. 

I-cache  speedup  depends  on  pb,  but  also  on  the  relative  dock  rates  of  the 

Pb,  the  factor  by  which  the  instruction  broadcast  interval  exceeds  ipE,  is  a  critical  determiner 
of  I-cache  speedup.  The  benefit  of  instruction  caching  depends  also  on  the  other  px  values,  which 
determine  the  times  for  operations  using  the  various  MCSs. 

A  collection  of  values  for  the  MCS  dock  intervals  comprises  a  p-set.  A  p-set  is  a  set  of  five  numbers 
of  the  form 


p-set={pb,  pr.  Pi,  Pc,  Pi} 
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Multi-chip  Subsystem 

Important 
inter-chip  wires 

Wire 

delay 

Factor  by 
which  minimum 
clock  interval 
exceeds  tpE 

Global  instr  bdcast 

Global  broadcast  network 

Wb 

Pb 

Response 

Response  network 

Wr 

Pt 

Input/output 

System  data  I/O  network 

Wi 

Pi 

Inter-PE  comm 

Inter-PE  comm  network 

Wc 

Pc 

Local  external  memory 

Intra-building-block  connection 

Wi 

Pi 

Table  3.5:  Summary  of  delay  model  terms,  px  « 


Due  to  the  limited  size  of  a  manufacturable  chip,  a  high-PE-count  SIMD  computer  incorporates 
more  than  one  PE  chip.  In  fact,  a  high-PE-coxmt  SIMD  computer  likely  encompasses  an  integration 
hierarchy,  containing  MCMs,  PCBs,  racks,  chassis,  and  so  on.  Significant  delays  are  incurred  for 
signaling  across  chip  boundaries,  due  to  the  relatively  large  energies  required  for  inte]>chip  signaling. 
Similar,  although  less  marked,  penalties  accrue  crossing  other  boundaries  in  an  integration  hierarchy. 
Wx  depends  in  part  on  the  level  of  integration  hierarchy  boundaries  crossed  by  tbe  wires  in  MCS  X. 
The  following  eniuneration  considers  the  likely  values  of  the  various  Wx'. 

1 .  The  global  instruction  broadcast  network  connects  a  single  soiirce  (the  system  controller)  to  all 
PE  chips  in  the  computer.  The  global  instruction  broadcast  network  wires  are  long,  geometri¬ 
cally  complex,  and  electrically  heavily  loaded,  so  Wt,  is  large.  As  pointed  out  in  Section  3.8,  for 
this  reason  the  broadcast  network  wires  may  be  driven  as  transmission  lines. 

2.  The  response  network  is  system-wide  in  extent,  aggregating  fan-in  firom  each  PE.  VVr  is  therefore 
likely  to  be  large. 

3.  The  system  I/O  network  structiue  can  vary  over  a  wide  range.  Depending  on  the  system, 
this  network  may  contain  long  wires  and  Wi  may  be  large.  In  a  specialized  SIMD  computer,  for 
example  one  xised  for  CCD  sensor-embedded  image  processing,  the  I/O  network  may  incorporate 
relatively  short  wires  and  so  may  be  low. 

4.  To  the  extent  that  the  wires  in  the  inter-PE  communication  network  are  long,  Wc  is  large. 
Networks  of  high  topological  dimension  necessarily  contain  long  wires  [79]. 

Regular  meshes  contain  point-to-point  wires.  At  least  one  wire  in  a  mesh  inter-PE  network 
must  cross  a  boundary  at  the  coarsest  level  of  integration  in  the  system’s  hierarchy.  By  folding 
such  a  network  (as  described  in  [1 9](p.l54)),  the  wires  become  as  short  as  possible  for  wires  hav¬ 
ing  to  cross  integration  hierarchy  boundaries.  In  this  case,  Wc  approaches  the  Tniniwinm  delay 
exhibited  by  a  wire  crossing  integration  hierarchy  boimdaries  in  neighbor  inter-PE  commiinica- 
tion  networks.  Fast  inter-PE  communication  on  regular  grids  is  illustrated  for  multiprocessors 
in  Mosaic  [71]. 

5.  The  local  external  memory  array  is  packaged  alongside  the  PE  chip  within  a  PE  building 
block.  The  building  block  is  a  physically  compact  system  component.  If  the  building  block  is 
implemented  using  the  fastest  available  inter-chip  technology,  then  local  external  memory  wire 
delay  W\  approaches  the  minimum  possible  for  inter-chip  wiring. 
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As  VLSI  implementation  technique  continues  to  improve,  inter-chip  wire  delays  do  not  decrease 
as  fast  as  intra-chip  circuit  speeds  increase.  tends  to  increase  over  time,  so  the  values  in  a  p-set 
tend  to  increase  over  time. 

A  generic  SIMD  computer  uses  a  single  clock  to  regulate  all  subsystems.  When  Wi,  >  fpe,  PEs  in 
generic  SIMD  computers  are  under-utilized.  Similarly,  when  IVb  >  for  some  MCS  X  in  a  generic 
SIMD  computer,  that  MCS  is  londer-utilized. 

3.10  Alternatives  for  Maximum-rate  Instruction  Delivery 

When  the  highest  PE  clock  rate  exceeds  the  rate  of  global  instruction  broadcast,  instruction  delivery 
becomes  a  bottleneck  in  generic  SIMD  computations.  What  options  are  available  to  the  architect  to 
overcome  this  limitation? 


Maximum-rate  Instruction  Delivery 


Broadcast  Instructions /  \ 

at  PE  Clock  Rate  /  \ 

Broadcast  Wide  Instructions  /  \ 

Controlling  Multiple  Cycles’  Activity  /  \ 

Broadcast  Complex  Instructions  \ 

for  Microprogrammed  PE  \ 

Locally  Buffer  Repeated  Instructions 
(SIMD  Instruction  Cache) 

Figure  3.18:  Alternatives  for  Maximum-rate  Instruction  Delivery 

Figure  3.18  illustrates  a  range  of  possible  options  for  delivering  instructions  to  the  PEs  at  the 
maximum  rate.  The  dotted  line  pointing  to  the  fast  global  instruction  broadcast  option  indicates 
that  the  option  is  not  always  available.  The  remaining  options  include  PE  microprogramming, 
wide-instruction  broadcast,  and  I-cache. 

3.10.1  PE  Microprogramming 

The  inherent  chip-area  advantage  of  SIMD  computers  arises  &om  consolidating  replicated  program 
control  that  is  redxindani  in  some  cases.  The  SIMD  computer  designer  typically  attempts  to  minimize 
instruction  decoding  logic  within  the  PE  chip,  becatise  this  logic  is  redundantly  replicated  with  each 
PE  chip.  Some  decoding  is  sometimes  unavoidable,  for  example  to  conserve  the  number  of  PE  chip 
pins  used  for  receiving  broadcast  instructions. 

One  way  to  keep  the  PE  chip  supplied  with  instructions  for  each  of  the  multiple  PE  chip  cycles 
following  receipt  of  each  broadcast  instruction  would  be  to  globally  broadcast  “complex”  instructions 
to  the  PEs.  The  PE  chip  local  controller  in  this  case  woiild  contain  a  microprogram  sequencer.  Each 
globally  broadcast  instruction  would  dispatch  a  multi-step  microprogram  that  is  sequenced  within 
the  PE  chip. 

Many  aspects  of  PE  microprogramming  are  architecturally  undesirable.  The  main  drawback 
of  PE  microprogramming  applies  also  in  microprocessors,  namely  that  microprogramming  commits 
considerable  chip  area  to  a  mechanism  capable  of  sequencing  a  fixed  complex  instruction  set,  each 
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member  of  which  may  or  may  not  be  appropriate  for  the  task  at  hand  [65,  38],  Chip  area  used  for 
program  control  that  could  otherwise  be  performed  oflf-line  is  better  used  for  FU  and  registers  where 
maximum  throughput  is  the  design  objective.  Chip  area  is  especially  precious  in  high-PE-coxint 
computers,  wherein  large  numbers  of  replications  raise  the  stakes  for  efficient  use  of  PE  chip  area. 
An  additional  unfortunate  consequence  of  PE  microprogramming  is  a  degree  of  indexibiLity  that 
further  restricts  the  class  of  problems  for  which  the  SIMD  computer  is  appropriate. 

3.10.2  Wide*Instruction  Broadcast 

Another  way  to  provide  a  new  PE  instruction  on  each  PE  clock  cycle  would  be  to  globally  broadcast 
groups  of  instructions  in  parallel.  The  global  instruction  broadcast  network  wovdd  dehver  many 
instructions  to  the  PE  chip  in  parallel  on  each  system  dock  cyde.  The  PE  chip’s  local  controller 
would  select  a  sequence  of  individual  PE  instructions  firom  each  globally  broadcast  group. 

Broadcasting  multiple  instructions  per  dock  cyde  is  tantamount  to  broadcasting  one  wide  in¬ 
struction.  Such  an  approach  to  instruction  delivery  is  infeasible  because  of  the  demand  it  places  on 
PE  chip  pins.  There  are  not  likely  to  be  enough  pins  on  an  affordable  manufacturable  PE  chip  to 
allow  even  2  or  3  instructions  to  be  received  at  once,  let  alone  as  many  as  8  or  more. 

3.10.3  SIMD  Instruction  Cache 

A  SIMD  instruction  cache  is  an  explidtly  managed  instruction  buffer  within  the  PE  chip.  I-cache 
added  to  the  PE  chip  local  controller  comprises  instruction  memory  and  means  of  accessing  it. 
Repeated  sequences  of  instructions,  such  as  those  appearing  in  loop  bodies,  are  '  ored  on-chip  in  the 
I-cache  for  subsequent  execution  at  the  relatively  high  rate  attainable  within  the  confines  of  the  PE 
chip. 

Compared  to  the  alternative  of  PE  microprogramming,  I-cache  is  a  flexible  means  of  defining 
“complex”  instructions  as  sequences  of  primitive  instructions  sequenced  at  the  highest  possible  rate. 

I-cache  appears  to  be  either  more  practical  or  more  affordable  than  the  alternative  techniques  for 
overcoming  the  instruction  delivery  rate  bottleneck  that  arises  when  ph>  1- 
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Figure  4.1  illustrates  a  local  controller  for  the  PE  chip  of  an  I-cached  SIMD  computer.  Whereas  the 
generic  SIMD  local  controller  (illustrated  in  Figure  3.3)  is  extremely  simple,  I-ca(^e  introduces  some 
new  design  complexity. 

From  Global  Instruction  Broadcast  Network 


To  Local  Instruction  Broadcast  Network 


Figure  4.1:  Local  Controller  with  I-cache 

The  multi-clock  generator  shown  in  Figure  4.1  provides  all  of  the  required  clocks  within  the  PE 
chip.  The  fastest  dock  output  by  the  multi-dock  generator  is  the  PE  dock,  which  is  pb  times  faster 
than  the  system  dock.  The  elements  of  a  SIMD  computer  operate  in  lock-step,  so  all  PE  chip  clocks 
are  synchronized  to  the  system  dock. 

Because  there  are  multiple  dock  rates,  and  because  the  number  of  PE  dock  cydes  reqiiired  by 
an  MCS  to  perform  an  operation  depends  in  some  cases  on  the  particular  operation  performed,  an 
MCS  operation  may  condude  on  an  arbitrary  one  of  the  multiple  PE  dock  cydes  which  occur  between 
successive  system  clock  cydes.  Therefore,  the  format  of  globally  broadcast  instructions  is  augmented 
for  an  I-cached  SIMD  computer  so  that  a  globally  broadcast  instruction  may  specify  the  index  of  the 
PE  clock  cyde  on  which  the  MCS  operation  specified  by  that  instruction  completes. 

The  cache  controller  shown  in  Figure  4.1  outputs  a  new  instruction  for  local  broadcast  within  the 
PE  chip  on  every  cyde  of  the  PE  dock.  When  a  new  globally  broadcast  instruction  arrives  at  the  PE 
chip,  the  cache  controller  may  provide  a  copy  of  that  instruction  within  the  chip.  The  cache  controller 
also  delays  instructions  terminating  high-latency  MCS  operations  until  the  PE  dock  cyde  specified 
by  the  most  recently  globally  broadcast  instruction. 

The  cache  controller  also  manages  the  cache  memory.  Under  the  direction  of  globally  broadcast 
instructions,  the  cache  controller  stores  repeated  instruction  sequences,  or  cache  blocks,  in  cache 
memory  and  subsequently  sequences  them  within  the  PE  chip.  VWth  I-cache,  cache<ontrol  instruc¬ 
tions  are  added  to  the  generic  SIMD  computer's  repertoire  of  globally  broadcastable  instructions.  A 
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variety  of  I-cache  designs  are  possible,  each  associated  with  a  particular  set  of  cache-control  instruc¬ 
tions. 

1-cache  speedup  depends  on  a  variety  of  factors  including  the  set  of  functions  implemented  in  the 
cache  controller,  the  number  of  instructions  that  can  be  stored  in  cache  memory,  the  programs  being 
executed,  and  the  means  by  which  the  I-cache  is  controlled  in  the  broadcast  instruction  stream. 

This  chapter  presents  design  considerations  for  the  comiwnents  shown  in  Figure  4.1 .  A  family  of 
I-cache  variants  is  presented,  and  the  chip  area  occupied  by  an  I-cache  is  estimated.  These  examples 
quantify  the  displeicement  of  payload  from  the  PE  chip  ensuing  firom  I-cache.  Simplified  example 
programs  are  used  to  illustrate  the  interactions  of  properties  of  programs  and  cache  controller 
functions  for  members  of  the  family  of  variants.  Finally,  issues  in  the  programming  problem  of  static 
I-cache  management  are  presented. 


4.1  I-cache  Design  Elements 

Several  detailed  implementations  of  the  local  controller  in  an  I-cached  SIMD  computer  are  possible. 
The  design  of  the  local  controller  determines  its  chip  area.  Also,  the  local  controller  design  relative 
to  the  requirements  of  the  program  controlling  a  given  computation  determines  what  firaction  of 
the  factor  of  pb  maximum  execution-rate  increase  is  realized  for  the  computation.  This  section 
enumerates  the  design  elements,  identifying  those  of  greatest  concern  to  the  I-cached  SIMD  computer 
designer. 

4.1.1  Multi-clock  Generator 

The  maximum  operation  rate  of  each  subsystem  is  determined  by  the  VLSI  implementation  technique 
and  by  the  topologies  of  the  subsystem’s  inter-chip  wires.  If  the  computer  is  synchronous  and  each 
subsystem  has  a  unique  maximum  operation  rate,  then  regulating  each  subsystem  at  its  maximum 
rate  requires  a  unique  clock  per  subsystem. 

One  way  to  distribute  clocks  at  the  various  rates  is  to  broadcast  them  globally.  In  a  high-PE- 
count  SIMD  computer,  global  broadcast  of  high-rate  clocks  would  be  performed  using  transmission 
lines  [1  KChap.S).  The  key  design  constraint  for  the  resulting  clock  distribution  network  is  the  match¬ 
ing  of  path  lengths  and  impedances,  to  minimize  skew  and  reflectance.  Such  a  dock  distribution 
network  forms  a  dock  pipeline,  wherein  multiple  dock  events  are  propagating  at  any  given  time. 
Pipelined  clocking  is  commonly  used  to  regulate  synchronous  PE  arrays  [24](Chap.3).  Pipelined  dock¬ 
ing  cannot  provide  a  dock  whose  interval  is  less  than  the  TnininuiTn  inter-chip  signaling  interval. 
Another  potential  problem  with  pipelined  clocking  is  the  high  chip-area  cost  of  a  dock  distribution 
network  containing  large  numbers  of  chips  in  a  fanout  tree. 

An  alternative  means  of  obtaining  the  required  docks  is  to  generate  them  within  the  PE  chip.  One 
widely  practiced  means  of  generating  on-chip  docks  uses  a  voltage-controlled  oscillator  (VCO)  that 
is  synchronized  with  other  chips  using  a  phase-locked  loop  (PLL)  [12].  PLL-based  clock  generators 
are  increasingly  commonly  in  microprocessors,  wherein  the  on-chip  dock  rate  tends  to  be  a  multiple 
of  the  off-chip  reference  clock  rate  [5,  88].  PLLs  have  been  described  for  use  in  synchronization 
among  MIMD  PEs  [64](Sec.3.3.1).  PLLs  have  also  been  used  for  dock  skew  elimination  in  SIMD 
computers  [44, 16],  although  not  for  multiplying  the  system  clock  rate  to  obtain  a  higher  PE  dock 
rate. 

The  multi-clock  generator  is  a  mundane  architectural  element,  because  PLL-based  dock  gen¬ 
erators  have  long  been  used  in  computers,  as  demonstrated  in  [46].  What  is  important  about  the 
multi-clock  generator  is  the  chip  area  that  it  occupies.  Although  VCOs  tend  not  to  scale  with  VLSI 
implementation  technique  as  readily  as  does  switching  logic,  they  are  practically  useful  for  modem 
microprocessors  [5]. 
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On-chip  generation  of  a  high-rate  PE  clock  that  is  phase-locked  to  the  system  clock  does  not  elim¬ 
inate  the  low-skew  requirement  on  system  clrck  distribution.  However,  it  less  difficult  to  distribute 
one  clock  signal  with  low  skew  than  it  is  to  distribute  many. 

The  multi-clock  generator  design  may  introduce  artificial  constraints  among  the  various  clock 
rates,  as  does  the  design  described  in  Appendix  A.  The  multi-clock  generator  described  in  Appendix  A 
provides  a  set  of  firee-running  clocks  whose  rates  are  constrained  to  be  integer  mvdtiples  of  the  PE 
clock  rate  and  integer  sub-multiples  of  the  system  clock  rate.  An  alternative  multi-clock  generator 
might  allow  a  clock  to  be  stopped  when  the  subsystem  it  reg\ilates  is  idle,  or  similarly  allow  a  clock’s 
phase  to  be  varied  dynamically  to  minimize  the  delay  in  starting  an  operation  on  an  otherwise  idle 
subsystem.  Inevitably,  the  design  of  a  midti-dock  generator  encompasses  meeting  several  interacting 
system  timing  constraints.  [69]  provides  an  excellent  overview  of  these  issues. 

4.1.2  Cache  Memory 

The  cache  memory  may  have  one  or  more  ports.  With  a  two-ported  cache  memory,  one  port  is  used  to 
provide  instructions  already  stored  while  the  other  port  is  used  concurrently  to  pre-store  instructions 
that  will  be  needed  subsequently. 

Assinning  that  the  best  available  memory  cell  with  the  required  number  of  ports  is  used,  the 
only  cache  memory  parameters  of  interest  to  the  I-cached  SIMD  computer  designer  are  instruction 
word  width  (in  bits)  and  cache  size  (in  instructions),  which  together  determine  the  number  of  cache 
memory  cells  and  thus  the  chip  area  occupied  by  the  memory. 

4.1.3  Cache  Management 

The  decisions  as  to  which  cache  blocks  to  place  where  in  the  cache,  as  well  as  when  to  put  them  there 
and  when  to  activate  them,  are  all  explicit  in  the  global  instruction  broadcast  stream.  For  simple 
programs,  good  choices  can  be  made  statically,  in  advance  of  running  the  computation.  Programs 
whose  loop  structiue  is  complex  may  require  these  decisions  to  be  made  in  the  system  controller 
during  the  course  of  a  computation,  in  which  case  the  system  controller  requires  a  potentially  complex 
cache-management  mechanism.  Section  4.5  discusses  the  management  problem  in  detail,  illustrating 
good  solutions  for  simple  I-cache  variants. 

4.1.4  Cache-control  Protocol 

The  globally  broadcast  instructions  in  an  I-cached  SIMD  computation  include  cache-control  instruc¬ 
tions  in  addition  to  the  usual  PE  instructions.  The  cache-control  instructions  follow  a  cache-control 
protocol  to  store  cache  blocks  and  to  activate  them.  Each  I-cache  variant  specifies  a  cache-control 
protocol. 

4.1.5  Cache  Controller 

As  indicated  in  Figure  4.1 ,  the  cache  controller  is  interposed  between  the  global  instruction  broadcast 
network  and  the  local  instruction  broadcast  network  within  the  PE  chip.  The  cache  controller  selects 
the  source  of  the  instruction  driven  onto  the  local  broadcast  network  on  every  PE  chip  clock  cycle. 
The  cache  controller  also  manages  the  control  inputs  to  the  cache  memory,  and  so  contains  a  program 
coimter  providing  a  cache  memory  address. 

The  cache  controller  design  and  the  concomitant  cache-control  protocol  are  the  principal  discrim¬ 
inator  among  I-cache  variants. 

There  are  many  possible  ways  to  denote  the  cache  locations  occupied  by  a  cache  block.  For 
example,  a  cache  block  may  be  delimited  by  markers  placed  in  the  cache  memory;  alternatively,  a 
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cache  block  may  be  delimited  by  a  parameter  supplied  upon  its  activation.  Loops  may  or  may  not 
be  unrolled  when  cached,  subject  to  the  details  of  a  given  cache  design.  A  cache  block  does  not 
necessarily  correspond  to  an  entire  loop  or  subroutine  body  appearing  in  a  program;  some  I-cache 
variants  profitably  cache  subsequences  of  program  bodies.  A  particular  I-cache  variant  may  allow 
multiple  entries  or  multiple  exits  for  a  given  cache  block  to  fadhtate  a  compact  representation. 

A  globally  broadcast  cache-control  instruction  alerts  the  local  controller  to  begin  storing  a  cache 
block  at  a  specified  cache  address.  This  instruction  may  also  specify  the  length  of  that  cache  block, 
or  the  end  may  be  indicated  by  a  cache-control  instruction  transmitted  at  the  end  of  the  cache  block. 

A  cache  block  is  activated  with  a  call  specifying  the  parameters  required  for  its  execution,  possibly 
including  initial  and  final  cache  addresses  and  iteration  count.  Some  I-cache  variants  provide 
mechanisms  allowing  cache  blocks  to  activate  one  another,  or  to  nest,  in  cache  with  varying  degrees 
of  generality. 

4.2  A  Family  of  Single-Port  I-cache  Variants 

A  cache  design  is  characterized  by  answers  to  each  of  the  following  five  questions: 

•  Does  cache  memory  have  more  than  one  port? 

•  What  is  the  maximum  niunber  of  instructions  that  can  be  stored  in  cache  memory? 

•  Can  more  than  one  block  be  stored  in  cache  memory  at  a  given  time? 

•  Can  the  cache  controller  independently  iterate  cache  blocks? 

•  Do  cache  blocks  nest?  In  other  words,  can  cache  blocks  activate  one  another? 

The  I-cache  parameters  imply  the  existence  of  several  distinct  classes  of  cache  design.  There  are 
fewer  than  16  classes,  because  some  of  the  parameters  are  not  mutually  independent.  For  example, 
it  is  not  possible  for  cache  blocks  to  nest  in  a  cache  that  can  contain  just  one  block. 

The  Isist  three  questions  in  the  list  apply  only  to  the  design  of  the  cache  controller.  To  illustrate 
the  range  of  possible  cache  designs,  consider  the  F-family  of  single-port  I-cache  variants. 

An  F-family  cache  memory  has  one  port,  so  concurrently  pre-storing  cache  blocks  is  not  possible 
with  F-family  caches.  The  ends  of  cache  blocks  are  delimited  with  sentinels,  so  that  the  leng^th  of  a 
cache  block  is  not  specified  in  the  block’s  activation. 

A  member  of  the  family  is  designated  F,  .  The  six  family  members  vary  in  their  cache  controller 
characteristics,  as  enumerated  below: 

Fo  is  a  “one-block,  one-shot”  cache.  Fq  is  the  simplest  F-family  cache.  Fo  is  capable  of  containing 
only  a  single  cache  block  at  any  given  time  and  of  executing  single  passes  through  a  cache 
block.  A  cache-control  instruction  activating  an  Fo  cache  block  supplies  no  parameters,  because 
there  is  only  one  possible  starting  address,  the  ending  address  is  delimited  explicitly  in  cache 
memory,  and  the  iteration  count  is  always  1. 

Fi  is  a  “multi-block,  one-shot”  cache.  Fi  is  similar  to  Fo  in  that  it  executes  only  single  passes 
through  cache  blocks.  However,  Fi  is  not  as  simple  as  Fo,  because  Fj  is  capable  of  containing 
more  than  one  cache  block  at  once.  The  questions  relating  to  where  to  place  each  cache  block 
are  germane  for  an  Fi  cache,  giving  rise  to  the  myriad  of  replacement  algorithm  issues  that 
have  been  studied  in  the  contexts  of  ordinary  caches  and  of  virtual  memory  management.  A 
cache-control  instruction  activating  an  Fi  cache  block  supplies  a  single  parameter  specifying 
the  starting  address  of  the  cache  block. 
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F2  is  a  “one-block,  multi-shot”  cache.  F2  is  similar  to  Fq  in  that  it  contains  only  a  single  cache 
block.  However,  F2  is  not  so  simple  as  Fq,  because  F2  is  capable  of  sequencing  through  a  cache 
block  multiple  times  in  response  to  a  single  activation.  A  cache-control  instruction  activating 
£Ln  F2  cache  block  supplies  a  single  parameter  specifying  a  number  of  iterations  of  the  cache 
block.  F2  does  not  reqxoire  the  ability  to  place  cache-control  instructions  in  cache. 

F3  is  a  “multi-block,  multi-shot”  cache.  F3  can  contain  mvdtiple  cache  blocks,  any  of  which  can 
be  iterated.  A  cache-control  instruction  activating  an  F3  cache  block  supplies  two  parameters, 
the  first  specifying  the  starting  address  and  the  second  specifying  a  number  of  iterations  of  the 
cache  block. 

F5  is  a  “mxilti-nestable-block,  one-shot”  cache.  F5  is  similar  to  Fi ,  in  that  it  contains  mvdtiple  blocks 
that  are  executed  singly.  However,  F5  has  the  additional  characteristic  that  cache  blocks  may 
activate  single  iterations  of  one  another.  An  F5  cache  requires  that  cache-control  instructions 
may  be  placed  in  cache  memory. 

F7  is  the  most  complex  member  of  the  F-famdy.  An  F?  cache  may  contain  mvdtiple  blocks  that 
may  activate  one  another  for  mvdtiple  iterations.  An  F7  local  controller  contains  a  scaled-down 
rephca  of  the  system  controller’s  program-control  components.  Since  an  entire  program  could 
be  stored  in  an  F7  cache,  an  PE  chip  with  an  F7  cache  becomes  a  mini-SIMD  computer  in  its 
own  right.  If  the  individual  local  controllers  were  allowed  to  progress  through  difierent  paths 
through  the  program  in  cache,  an  F7-enhanced  SIMD  computer  becomes  a  mvdti-SI^ro  (or 
MSIMD)  computer  [10]. 

Detaded  designs  for  F©  and  F2  I-cache  variants  are  given  in  Appendix  A. 


4.3  I-Cache  Chip-Area  Estimates 

In  order  to  for  I-cache  to  yield  significant  speedup  for  real  problems,  it  must  not  displace  too  much 
of  the  payload  fixvm  the  PE  chip.  7  expresses  the  firaction  of  chip  area  taken  up  by  I-cache  inside  the 
PE  chip,  so  7  expresses  the  PE  chip  payload  reduction  due  to  I-cache.  This  section  relates  7  to  the 
PE  chip  payload  estimates  derived  in  Section  3.7.  The  lowest  values  of  7  are  made  possible  by  the 
most  advanced  VLSI  implementation  technique,  chip-area  estimates  for  F©  and  F2  caches,  two  of 
the  simplest  members  of  the  F-family,  indicate  that  I^©  and  F2  caches  are  feasible  for  current  VLSI 
implementation  technique. 

The  local  controller  and  MCS  interfaces  compete  with  PEs  for  interior  area  I  even  in  a  generic 
SIMD  computer’s  PE  chip.  I-cache  of  the  local  controller  further  reduces  the  payload,  assuming  that 
physical  dimensions  H  and  W  and  resolution  parameter  A  are  fixed.  The  extent  to  which  the  payload 
of  the  PE  chip  is  reduced  by  I-cache  depends  on  the  chip  area  occupied  by  the  multi-dock  generator, 
the  cache  controller,  and  the  cache  memory.  The  cache  controller’s  chip  area  depends  on  on  the  set 
of  functions  performed  by  the  I-cache  variant,  while  the  cache  memory’s  chip  area  depends  on  the 
width  (in  bits)  of  an  instruction  word  and  the  maximum  number  of  instructions  that  the  memory 
contains.  Reducing  the  payload  in  a  PE  chip  means  reducing  the  number  of  PE  bit  slices  in  the  chip 
and/or  reducing  one  or  more  of  the  PE  architecture  parameters  (induding  datapath  width,  FU  drcuit 
complexity,  and  number  of  registers). 

n  denotes  the  payload  of  a  PE  chip.  Estimates  for  three  existing  PE  chips  and  one  h5^thetical 
PE  chip  are  summarized  in  Table  3.2.  Let  Ilg  denote  the  payload  of  a  generic  SIMD  PE  chip,  and 
let  He  denote  the  payload  of  the  chip  when  I-cache  is  added  to  its  local  controller.  I-cache  forces  the 
local  controller  to  expand  by  some  chip  area  S.  Let  Ag  represent  the  chip  area  of  the  interior  of  that 
generic  PE  chip  occupied  by  the  local  controller  and  by  MCS  interfaces,  and  let  Ac  be  the  non-PE 
chip  area  in  the  chip  with  I-cache.  'Then  Ac  is  given  in  Equation  4.1 : 


56 


CHAPTER  4.  I-CACHED  saw  COMPUTER  DESIGN 


Ac  =  Ag*S  (4.1) 

The  payloads  of  the  PE  chip  without  and  with  I*cache  are  given  in  Elquations  4.2  and  4.3: 


I-Ag 

(4.2) 

I- Ac 

(4.3) 

I-Ag-6 

(4.4) 

ng-6 

(4.5) 

The  increased  chip  area  occupied  by  local  controller  as  a  firaction  of  the  generic  PE  chip  payload 
is  the  marginal  chip  area  occupied  by  I-cache.  7  represents  this  marginal  decrease  in  payload,  as 
shown  in  Equation  4.6: 


7  * 


S 

Dg 


The  ratio  of  payload  with  I-cache  to  payload  without  I-cache  is  given  in  Equation  4.7: 


(4.6) 


lie  llg  —  S 

Eg  Ilg 

«  1-7 


(4.7) 

(4.8) 


Because  ^  >  0,  7  >  0  and  lie  is  strictly  less  than  Eg.  It  remains  to  consider  the  specific  sizes  of 
the  various  local  controUer  components  needed  for  I-cache. 

Existing  on-chip  clock  generators  serve  as  guides  in  estimatmg  the  chip  area  of  the  multi-clock 
generator.  A  clock  generator  has  been  fabricated  entirely  within  CMOS  (^ps  [88].  The  osdEator 
dominating  the  clock  generator's  chip  area  runs  at  up  to  220  MHz  and  is  phase-locked  to  a  lower- 
rate  external  clock.  The  entire  clock  generator  occupies  .31mm^  in  a  A-OA/j  process,  or  roughly 
1 .9  X  10^  sites.  Although  the  analog  components  inhibit  scaling  dock  generator  chip  area  as  readily 
as  register  or  FU  chip  area,  the  area  of  this  existing  dock  generator  is  useful  for  approximating  the 
area  of  the  multi-dock  generator  needed  for  I-cache.  Phase-locking  is  not  the  only  way  to  generate 
fast  clocks;  other  means  to  achieve  high-rate  timing  references  indude  the  use  of  synchronous  delay 
lines,  described  in  [5]  as  occupying  chip  area  of  the  same  order  as  the  PLL  described  in  [88].  The 
most  general  multi-dock  generator  derives  aU  required  docks  as  sub-multiples  of  firom  one  high-rate 
reference.  Allowing  a  factor  of  two  in  chip  area  for  the  circuits  added  to  the  PIX  to  derive  multiple 
subsystem  clocks  as  sub-multiples  of  the  PE  dock,  a  multi-dock  generator  would  occupy  less  than 
4x10®  sites. 

An  Fo  cache  controller  contains  a  state  register  with  next-state  logic,  3  counter-registers,  two 
instruction  latches,  a  4-input  instruction  selector,  and  a  small  number  of  other  smaU  logic  blocks. 
An  F2  cache  controller  (shown  in  Figure  A.11  to  be  a  superset  of  an  Fq  cache  controller)  contains 
an  additional  counter-register  and  an  associated  logic  block.  The  smaU  logic  reqiiirement  for  these 
simple  I-cache  variants  is  very  likely  to  be  dominated  by  the  cache  memory  itself  in  the  use  of  chip 
area  for  I-cache. 

A  typical  register  memory  ceU  may  be  used  as  a  conservative  estimate  for  the  chip  area  of  the 
cache  memory  cell.  A  typical  CMOS  register  memory  cell  occupies  40A  x  40A  area,  or  1600  sites. 
Note  that  single-port  memories,  as  used  with  an  F-femily  cache,  occupy  less  chip  area  than  their 
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SLAP 

MP-1 

Bhtzen 

ALAPH 

payload  without  I-cache  (Ilg)  (xlO®  sites) 

36 

118 

356 

1300 

lOO-word  cache:  ^  «  10  x  10®  sites 

payload  with  I-cache  (He)  (xlO®  sites) 

26 

108 

346 

1290 

payload  decrease  (7)  (%) 

28 

8 

3 

1 

1000-word  cache:  S  ^55x  10®  sites 

payload  with  I-cache  (He)  (xlO®  sites) 

0 

63 

301 

1245 

payload  decrease  (7)  (%) 

100 

47 

15 

4 

Table  4.1 :  Sununaiy  of  Chip-Area  Estimates  for  Simple  I-Caches  in  Four  SIMD  PE  Chips 

multi-port  counterparts  such  as  may  be  used  in  the  PE  register  files.  32  of  these  cells,  as  needed  for 
one  32-bit  instruction  of  cache  memory,  occupies  51,200  sites.  A  conservative  estimate  would  allow 
further  chip  area  equivalent  to  20  words  of  memory  for  sense  amps  and  hit-Une  drivers,  or  about 
1  X  10®  sites. 

The  chip  area  occupied  by  one  of  these  simple  I-cache  variants  is  obtained  by  adding  the  areas  for 
the  mxilti-dock  generator  area  and  the  memory  cell  array  with  drivers.  A  resulting  formula  for  the 
area  occupied  by  an  F-tamily  I-cache  variant,  where  N  is  the  number  of  instructions  in  the  cache,  is 
given  in  Equation  4.9: 


6 


(5  +  xl0®sites 


(4.9) 


Figure  4.2  shows  values  of  7  against  N  for  the  4  PE  chips  whose  chip  area  has  been  measvired 
above.  Figure  4.2  shows  that  for  newer  VLSI  process  technologies  with  values  of  A  below  0.5^m,  up 
to  IK  words  of  cache  memory  correspond  to  values  of  7  less  than  .15. 

Figiue  4.3  plots  values  of  7  for  each  of  the  PE  chips  at  a  cache  size  of  1(X)  instructions.  Two  curves 
are  super-imposed  on  the  set  of  points,  the  lower  one  ignoring  the  data  point  for  the  SLAP  chip. 

VLSI  technology  continues  to  improve  over  time  so  as  to  make  possible  larger  manufacturable 
chips  with  decreasing  geometric  resolution  [23].  The  nximber  of  sites  in  a  chip  grows  linearly  with  the 
physical  area  H  xW  and  quadratically  with  the  component  density  j,  as  shown  in  Equation  3.1 .  The 
typical  SIMD  PE  chip  organization,  wherein  the  available  PE  area  is  tiled  with  replicated  bit  slices, 
readily  exploits  increases  in  the  size  of  the  PE  chip.  As  interior  area  I  grows,  the  allowable  nmnber  of 
PEs,  number  of  registers  per  PE,  and/or  FU  complexity  increases.  For  example,  re-implementation  of 
a  PE  chip  containing  a  PE  of  fixed  FU  complexity  at  higher  I  yields  a  PE  chip  containing  an  increased 
number  of  PEs,  each  with  increased  per-PE  memory. 

By  contrast,  the  MCS  interfaces  and  local  controller  occupy  a  number  of  sites  that  need  not 
change  as  the  VLSI  technology  improves,  so  that  Ac  remains  roughly  constant.  For  fixed  Ac  the 
VLSI  technology  scaling  effects  of  increasing  H  and  W  while  decreasing  A  takes  7  toward  0,  so  that 


value  of  ^  given  in  Equation  4.7  tends  toward  1 .  The  scaling  of  VLSI  fabrication  process  parameters 

cannot  continue  indefinitely,  however,  due  to  the  existence  of  fundamental  physical  lower  limits  on 
the  sizes  of  MOS  devices  and  their  interconnections  [43, 48]. 

The  estimates  for  7  suggest  that  VLSI  implementation  technique  has  only  recently  reached  suf¬ 
ficient  chip  component  densities  so  that  a  simple  I-cache  containing  about  100  instructions  occupies 
negligible  chip  area  in  a  PE  chip. 
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Figure  4.2:  7  v.  Cache  Size  for  Each  of  Four  PE  Chips 


0.3 

0.25 

0.2 

70.15 

0.1 

0.05 


Figiue  4.3:  7  v.  A  for  each  of  the  4  PE  Chips  at  iV=100 
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4.4  Effects  of  Program  Properties 

SIMD  computers  are  usually  used  to  solve  data-parallel  problems.  A  data-parallel  problem  is  divisi¬ 
ble  into  a  collection  of  sub-problems,  each  of  which  is  associated  with  a  subset  of  the  problem-defining 
input  data  set  [41  ].  One  way  to  solve  a  data-parallel  problem  is  to  distribute  the  sub-problems  to 
PEs  such  that  the  sub-problems  are  solved  concurrently.  This  distribution  induces  a  requirement 
for  inter-PE  communication  that  may  limit  the  throughput  of  the  computation.  How  much  inter-PE 
communication  is  required  is  proportional  to  how  much  the  sub-problems’  data-subsets  overlap.  The 
degree  of  overlap  varies  among  data-parallel  problems.  For  a  given  data-parallel  problem,  distribut¬ 
ing  the  sub-problems  so  as  to  minimize  the  amount  of  reqviired  inter-PE  communication  is  important 
for  achieving  efficient  computation  [50, 51 , 84]. 

High-level  data-parallel  programming  languages  often  abstract  the  details  of  data  set  sizes  and 
PE  coimts,  allowing  such  values  to  be  specified  at  compile  time,  or  even  at  run  time  [18].  When 
there  are  far  fewer  PEs  in  the  computer  than  there  are  sub-problems  to  solve,  the  computation 
consists  of  inner  loops  iterated  many  times.  It  is  just  this  sort  of  computation,  with  many  repeats  of 
instruction  sequences,  for  which  I-cache  should  be  most  effective.  On  the  other  hand,  if  not  many  of 
the  instructions  are  repeats,  then  I-cache  cannot  be  very  iiseful  in  speeding  up  the  computation. 

The  observations  that  inter-PE  communication  firequency  and  loop  iteration  counts  affect  I-cache 
speedups  raise  the  question,  what  are  the  properties  of  programs  that  should  affect  I-cache  speedup? 
How  are  these  properties  assessed  in  estimating  the  I-cache  speedup  for  an  arbitrary  program? 

This  section  enumerates  characteristics  of  programs  and  analyzes  how  each  affects  I-cache 
speedup. 

4.4.1  Proportion  of  Repeat  Instructions 

The  most  important  property  of  a  program  with  respect  to  I-cache  speedup  is  how  many  instructions 
are  repeats.  If  no  instructions  are  repeated,  then  I-cache  is  of  no  value,  whereas  if  most  instructions 
are  repeats,  then  I-cache  is  of  maximum  value.  Loop  iteration  counts  express  the  degree  of  instruction 
repetition. 

Consider  the  following  program,  sisple: 

program  alalia; 

B 

for  j  =  1  to  J  do 
A 
and; 

and  sijqpla; 

The  symbols  A  and  B  in  program  sinple  denote  sequences  of  instructions.  Assuming  that  A  is  a 
cachable  instruction  sequence,  the  problems  of  determining  which  sequence  to  store  in  cache,  when 
to  store  it,  and  when  to  activate  it  have  obvioxis  solutions. 

Where  the  length  of  sequence  A  is  denoted  A  and  the  length  of  sequence  B  is  denoted  B,  the 
number  of  cycles  of  the  system  clock  to  run  program  slnple  is  given  as 


time  for  simple  =  B  *  J  *  A  cycles 

Fq  is  the  simplest  member  of  the  F-family  of  I-cache  variants.  Fq  is  capable  of  storing  only  one 
cache  block  at  a  time  and  executing  only  single  passes  through  the  stored  block.  The  following 
skeleton  for  program  sistple.cache  illustrates  how  program  simple  is  modified  to  use  an  an  Fo 
I-cache: 
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progxmm  «liqpl«-eaehi*; 

B 

•toz*  MqiMBe*  A  In  enehn 
fox  j  ■  1  to  J  do 

aetlvato  fwrhod  ■oipianno  A 
•nd; 

•nd  ■Ij^plo.enehio; 

Ideally,  each  pass  thioxigh  cached  sequence  A  in  sisple-cache  is  fn,  times  faster  than  each 
corresponding  iteration  of  the  inner  loop  in  siaplest.  However,  storing  A  in  cache  is  an  extra  pass 
through  that  sequence  of  instructions  in  siaple-cacbe  that  has  no  counterpart  in  siaple. 

Assuming  that  instruction  sequence  A  is  cachahle  and  that  it  runs  faster  from  cache  by  the  factor 
Pb,  then  the  time  for  the  modified  program  to  run  with  I-cache  is  given  as 

time  for  siaple-cache  »  B  *  A*  J  *  —  cycles  (4.10) 

Ph 

The  time  for  program  siiqple-cache  reflects  the  time  to  execute  the  instructions  in  sequence  B 
in  addition  to  the  time  to  load  A  into  the  cache  and  subsequently  execute  it  firom  there.  The  time 
for  B  represents  the  impact  of  un-cachable  instructions  on  execution  with  I-cache,  while  the  extra 
pass  through  A  to  store  it  into  cache  memory  represents  the  run-time  overhead  of  using  I-cache.  The 
I-cache  speedup  is  given  as  the  ratio  of  the  two  execution  times: 

speedup  for  slnple  « 


B  *  J  *  A 


(4.11) 


Figure  4.4:  Ideal  I-Cache  Speedup  for  Program  sinple 

Equation  4.11  suggests  that  the  speedup  for  program  sisple  approaches  as  J  approaches 
oo,  irrespective  of  the  firaction  However,  if  that  fi:action  is  large,  so  that  the  great  majority  of  a 
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program’s  instructions  cannot  be  delivered  from  I-cache,  then  the  speedup  is  significantly  less  than 
p^.  The  impact  of  un-cachable  instructions  is  greatest  when  J  is  low,  as  illvistrated  in  Figure  4.4. 

4.4.2  Quantization 

Consider  the  speedup  for  siiqple  when  B-Q,  that  is,  when  all  of  the  program’s  instructions  are 
cachable.  Equation  4.11  becomes 


speedup  for  sispla 
when  B-Q 


It  is  interesting  to  note  that  the  length  of  sequence  A  cancels  in  Equation  4.12,  so  that  the 
speedup  is  independent  of  the  length  of  that  sequence.  Unfortunately,  this  simple  expression  is  not 
entirely  accurate.  The  time  taken  for  a  single  pass  through  a  cache  block  is  quantized  to  an  integer 
number  of  system  clock  cycles:  when  execution  of  a  single  iteration  of  a  cache  block  completes  in  an 
Fo  I-cache,  activity  halts  pending  receipt  of  the  next  globally  broadcast  instruction.  For  an  I-cache 
variant  incapable  of  iterating  cache  blocks,  this  waiting  time  is  spent  after  every  pass  through  the 
block.  Equation  4.13  gives  the  more  accurate  time  to  execute  siiqple.cache  with  B-0  using  an  Fo 
I-cache,  and  Equation  4.14  gives  the  corresponding  speedup: 


J*A 


Pb*  J 
Pb  + 


(4.12) 


time  for  siiqple-cache 
with  quantization 

speedup  for  siople 
when  B-0 
with  quantization 


PAT 

B  *  A-^  J  *  —  cycles 
Ipbl 

J*A 


(4.13) 

(4.14) 


Figure  4.5  shows  how  quantization  affects  I-cache  speedup  when  B=0.  The  speedup  impact  of 
quantization  is  most  pronounced  for  short  loop  bodies  and  for  high  iteration  coimts.  Note  that  if  A 
is  a  multiple  of  pb,  then  quantization  does  not  reduce  I-cache  speedup.  The  “steps”  in  Figure  4.5 
occur  at  values  of  pb  where  the  quantized  ratio  [ decreases.  It  is  apparent  in  Figure  4.5  that  for 
fixed  J,  speedup  does  not  increase  simply  with  the  length  of  the  cache  block  as  one  might  expect. 
Quantization  caiises  the  speedup  curves  to  cross  over  one  another  anomalously  at  some  values  of  pb- 
Higher  values  of  A  tend  to  smooth  out  the  effect  of  quantization. 


4.4.3  Loop  Structure 

The  objective  of  I-cache  is  to  deliver  repeated  instructions  at  the  highest  rate  fix>m  a  repository  within 
the  PE  chip.  The  actual  speedup  depends  in  part  on  the  way  in  which  instruction  sequences  occur 
in  a  program,  and  on  how  they  are  repeated.  In  many  computations  solving  data-parallel  problems, 
most  of  the  time  is  spent  executing  instructions  that  reside  in  the  bodies  of  inner  loops.  I-cache 
should  work  well  for  such  programs.  For  one-block-at-a-time  I-cache  variants,  including  Fq  and  F2, 
loop  bodies  whose  executions  alternate  in  time  displace  each  others’  cache  blocks.  This  alternation 
gives  rise  to  a  form  of  thrashing.  Thrashing  occurs  also  for  multi-block  I-cache  variants  when  the 
capacity  of  cache  memory  is  exceeded. 

Having  to  re-store  cache  blocks  reduces  the  I-cache  speedup.  To  illustrate  this  effect,  consider  the 
program  thzash: 
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=9,  J*1024 

=117,  J=16 
=65,  J=16 


Figure  4.5:  I-Cache  Speedup  for  Program  sixqple  with  Quantization,  when  B=0 

program  thrmah; 

for  1  B  1  'to  X  do 

for  j  1  to  J  do 
A 
and; 

for  j  :«  1  to  X  do 
B 


and  thraah; 

Where  A  and  B  are  the  lengths  of  instruction  sequences  A  and  B,  respectively,  then  the  runtime 
of  thzash  is  given  in  Equation  4.15: 

runtime  for  thrash  =  I(.JA*KB)  cycles  (4.15) 

Clearly,  if  both  instruction  sequences  coiild  be  accommodated  in  cache  memoiy  at  once,  as  with  a 
suitably  large  Fi  I-cache,  then  the  instruction  sequences  would  not  compete,  as  in  thrash-Fi : 

program  thraah-Fi; 

atora  aaquanca  A  in  eacha 
atora  aaquanca  B  in  wacha 
for  i  >  1  to  X  do 

for  j  X  to  J’  do 

aetlvata  cacbad  aaquanca  A 
and; 

for  j  :=  1  to  K  do 

aetlvata  cacbad  aaquanca  B 
and; 
and; 

and  thraah.Fi; 

The  runtime  and  speedup  for  thrash_one  are  given  below: 

=  A  +  B  *  lU—  *  K — )  cycles 
Ph  Ph 


runtime  for  thrash-£one 
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Fi  speedup 
for  thrash 
ignoring  quantization 


lUA  *  KB) 


Fo  is  a  one-block  I-cache,  capable  of  containing  only  one  cache  block  at  a  time.  Therefore,  the 
cache  blocks  in  thrash  replace  one  another  in  an  Fo  I-cache,  as  shown  in  thrash^o: 

pxogxaa  thrsah  f«To; 
for  1  s  1  ^  I  do 

«tor«  MqiMneo  A  In  cneho 
toT  j  :«  1  ^  J  do 

aetivat*  carhad  aaqnaaea  A 


atoxa  aagiiaaea  B  la  eaeha 
fox  j  1  ^o  K  do 

aetl'va'ta  rarhari  aaquanea  B 
and; 
and; 

and  thxash-fzaxo; 


The  runtime  and  speedup  for  thrash-fzero  are  given  below: 


runtime  for  thrash-fzero 

Fo  speedup 
for  thrash 
ignoring  quantization 


HA  *  J—  *  B  *  K—)  cycles 
Ph  Ph 

IUA*KB) 

IU*J^*B*K^) 

(JA  +  KB) 

UTT^TSTF^ 


The  thrashing  ofthe  cache  blocks  in  thrashj'o  increases  the  runtime  by  (J-l)  *(.A  *  B)  cycles. 
Figure  4.6  illustrates  this  difference. 


4.4.4  MCS-Intensiveness 

During  each  PE  clock  cycle  in  an  I-cached  SIMD  computation,  some  subsystem  is  b\isy,  be  it  the 
PE  FUs,  one  of  the  MCSs,  the  system  controller,  or  some  combination  of  these.  When  the  computa¬ 
tion  cannot  progress  pending  completion  of  an  operation  by  a  given  subsystem,  the  computation  is 
subsystem-bound  at  that  point.  For  example,  when  inter-PE  commiinication  is  in  progress  and  there 
are  no  other  operations  to  be  performed  that  do  not  depend  on  the  inter-PE  communication  result, 
then  the  computation  at  that  point  is  inter-PE-communication-bound. 

The  likelihood  of  a  computation  becoming  subsystem-bound  by  a  given  subsystem  corresponds 
to  the  intensiveness  of  use  of  that  subsystem  in  the  program  controlling  the  computation.  One 
aspect  of  MCS-intensiveness  is  measured  by  the  calc-to-comm  ratio,  the  ratio  of  the  nvunber  of 
calculation  operations  to  the  number  of  inter-PE  communication  operations  occurring  in  a  program. 
The  calc-to-comm  ratio  expresses  a  program’s  intensiveness  of  inter-PE  communication  relative  to 
FU  calculation. 

For  example,  consider  the  program  pezxnutez  with  the  following  structure: 

pxogxam  poxoaitax; 

fox  j  s  1  'to  J  do 

Y ;  perform  inter-PE  communication  operation 

A 

and; 

and  paxButax; 
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Figure  4.6:  Fi  and  Fq  Speedups  for  thzash  (assuming  A-B-40  and  and  ignoring  quan¬ 

tization) 

The  inner  loop  ofparoatax  contains  an  intei^PE  communication  operation  (labeled  Y)  followed 
by  an  instruction  sequence  A.  There  are  conditions  under  which  I-cache  cannot  speed  up  a  program 
like  persniter.  Specifically,  for  simple  I-cache  variants  including  Fq  and  Fi  that  are  not  capable  of 
iterating  stored  instruction  sequences,  there  is  no  I-cache  speedup  when  the  following  two  conditions 
obtain: 

1 .  Instruction  Y  is  flow-independent  of  the  instructions  in  the  sequence  A.  That  is,  Y  may  take 
place  concurrently  with  any  of  the  operations  specified  in  sequence  A. 

2.  The  duration  of  instruction  F  is  at  least  as  great  as  the  time  to  globally  broadcast  all  of  the 
instructions  in  A 

The  time  to  execute  the  loop  body  cannot  be  less  than  the  time  to  execute  instruction  Y,  which  is 
determined  by  the  duration  of  the  inter-PE  commxinication  operation.  Under  these  two  conditions, 
delivering  the  instructions  firom  A  at  a  higher  rate  will  not  reduce  the  time  needed  to  execute  the 
loop  body  of  peznuter.  Note,  however,  that  these  two  conditions  are  highly  restrictive.  If  there  is 
even  a  single  instruction  in  A  that  is  flow-dependent  on  Y  and  thus  cannot  overlap  Y,  then  I-cache 
may  speedup  the  loop  body.  It  is  likely  that  there  will  be  such  a  flow-dependence,  because  the  result 
of  the  inter-PE  communication  specified  in  F  is  likely  to  be  used  by  another  instruction  in  the  loop 
body. 

An  additional  condition  must  be  met  for  an  instruction  sequence  to  exhibit  no  speedup  for  I-cache 
variants,  including  F2,  that  are  capable  of  iterating  stored  instruction  sequences: 

3.  The  duration  of  instruction  F  is  divisible  by  p\,. 

Even  if  the  first  two  conditions  apply,  if  condition  (3)  does  not  apply,  then  an  F2  or  higher  F- 
family  I-cache  variant  can  begin  executing  the  next  iteration  of  the  loop  body  without  waiting  for  the 
ensuing  globally  broadcast  instruction.  For  this  reason,  there  may  be  some  speedup  unless  condition 
^3)  applies. 
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This  discxission  shows  that  I-cache  speedup  depends  strongly  on  the  particular  operations  per¬ 
formed  in  a  program.  I-cache  speedup  depends  on  the  specific  latencies  of  FU  and  MCS  operations, 
the  degree  to  which  such  operations  may  overlap  (as  constrained  by  flow-dependencies),  and  the 
ability  of  the  local  controller  to  iterate  cache  blocks  without  assistance  firem  the  system  controller. 

At  the  outset  of  a  computation,  before  any  instruction  has  been  broadcast,  the  computation  is 
global-instruction-broadcast-boimd.  Indeed,  whenever  no  long-duration  operation  is  outstanding 
and  there  is  an  operation  whose  operands  are  ready  in  PE  registers  and  whose  subsystem  is  idle, 
then  the  computation  is  instruction-delivery-bound.  One  way  of  viewing  the  function  of  I-cache  is  as 
minimizing  global-instruction-broadcast-boundedness. 

4.4.5  Relative  Subsystem  Clock  Rates 

The  severity  of  subsystem-boundedness  depends  not  only  on  the  MCS-intensiveness  of  the  program 
controlling  the  computation,  but  also  on  the  electrical  characteristics  of  the  subsystems  carrying  out 
those  operations.  Subsystem  electrical  characteristics  determine  the  time  required  for  the  subsystem 
to  perform  one  step  of  an  operation.  For  example,  if  local  external  memory  is  located  physically  close 
to  the  PE  chip  (shown  together  in  Figure  3.2),  then  signaling  through  the  local  external  memory 
subsystem  may  occur  at  the  maximvim  inter-chip  rate.  The  local  external  memory  subsystem  clock 
rate  would  be  high  in  this  case.  As  an  example  at  the  other  extreme,  router-to-router  communication 
steps  in  an  inter-PE  communication  network  of  high  topological  complexity  might  occvm  at  as  low  a 
rate  as  that  of  global  instruction  broadcast.  The  inter-PE  communication  subsystem  clock  rate  would 
be  low  in  this  case. 

Recall  (from  Section  3.9)  that  a  p-set  characterizes  the  relative  clock  rates  of  the  MCSs.  A  /)-set 
of  the  form  {N,  N,  N,  N,N},  wherein  the  PE  clock  rate  is  N  times  faster  than  the  clock  rate  of  every 
MCS,  characterizes  *  SIMD  computer  wherein  all  MCSs  operate  at  the  low  rate  of  global  instruction 
broadcast.  For  a  given  program,  subsystem-boundedness  is  most  likely  when  aU  MCS  clock  rates  are 
low,  because  MCS  operation  diirations  are  maximum.  A  p-set  of  the  form  {N,!  ,1,1,1}  characterizes 
a  computer  wherein  all  MCSs  other  than  the  global  instruction  broadcast  subsystem  operate  at  the 
high  PE  clock  rate.  These  two  />-sets  ratios  represent  the  extremely  slow  and  the  extremely  fast 
possibilities,  respectively,  for  the  clock  rates  of  MCSs  other  than  global  instruction  broadcast.  For 
a  given  program,  subsystem-boimdedness  is  least  likely  when  all  MCS  clock  rates  are  low,  because 
MCS  operation  durations  are  minimum. 

4.4.6  Problem  Size  and  PE  Count 

Problem  size  and  machine  size  are  two  of  the  most  important  parameters  of  computation.  Surpris¬ 
ingly,  the  speedup  due  to  I-cache  is  largely  independent  of  the  data  set  size  and  the  PE  count. 

To  the  extent  that  increased  PE  count  decreases  the  operation  rate  of  MCSs  or  increases  the 
stepcount  of  MCS  operations  (for  example,  inter-PE  communication,  I/O,  or  response),  increased  PE 
count  increases  p-set  values.  Increased  p-set  values  increases  the  time  spent  per  MCS  operation  and 
thereby  decreases  I-cache  speedup. 

Similarly,  increasing  the  problem  size  increases  the  amount  of  problem  data  stored  by  each  PE. 
When  the  data  storage  requirement  exceeds  the  PE’s  register  file,  some  data  is  accessed  in  external 
memory  at  a  lower  rate  than  in  the  register  file.  In  this  way,  increasing  the  problem  size  may  decrease 
I-cache  speedup. 

As  shown  in  Figure  4.4,  I-cache  speedup  is  extremely  sensitive  to  iteration  count.  To  a  first 
approximation,  iteration  coimts  are  proportional  to  the  ratio  of  data  set  size  to  PE  count.  I-cache 
speedup  is  more  sensitive  to  the  ratio  between  data  set  size  and  PE  count  than  it  is  to  the  absolute 
size  of  either  of  these  parameters. 
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4.4.7  Data-dependence 

Data-dependent  instruction  execution  occurs  in  a  program  when  the  sequence  of  instructions  is 
selected  by  values  derived  from  the  input  data  set.  There  are  two  types  of  data-dependence  in 
data-parallel  programs:  local  and  global  data-dependence.  Local  data-dependence  occurs  when  the 
sequence  of  instructions  to  be  executed  by  the  PE  depends  on  the  value  of  intermediate  data  that 
is  local  to  the  PE.  For  example,  an  IF-TBEN  program  construct  specifies  locally  data-dependent 
instruction  execution.  Global  data-dependence  corresponds  to  conditional  branching  performed  by 
the  program-control  portion  of  the  system  controller.  Global  data-dependence  depends  on  aggregate 
information  about  the  input  data  set.  For  example,  a  liHILE  loop,  executed  xmtil  all  PEs  converge  to 
some  final  state,  is  an  example  of  global  data-dependence. 

Surprisingly,  the  consequences  of  these  two  types  of  data-dependence  for  I-cache  speedup  differ 
markedly.  While  global  data-dependence  tends  to  lower  I-cache  speedup,  a  high  degree  of  local 
data-dependence  actually  favors  large  I-cache  speedups. 

As  an  example  of  local  data-dependence,  consider  a  simple  program  local-cond,  in  whose  loop 
body  the  PE  performs  one  of  two  sets  of  actions,  depending  on  the  value  of  the  PE  variable  x: 

program  local-cond; 

£or  i  =  1  to  X  do 

enable  memory  writes  only  where  (x  =  2) 

A 

enable  memory  writes  only  where  (x  !=  2) 

B 

end; 

end  local-cond; 

In  local-cond,  the  PE  should  execute  either  sequence  of  instructions  A  or  B,  but  not  both, 
depending  on  the  value  of  x  in  the  PE.  Because  the  PEs  in  a  SIMD  computer  share  a  single  instruction 
stream,  both  sequences  A  and  B  are  broadcast  to  the  PEs.  Context  management  instructions  direct 
the  context  manager  (shown  in  Figure  3.4)  to  suppress  writes  to  register  memory  within  PEs  to 
whose  data  the  ensuing  instruction  sequence  does  not  apply.  Assvuning  that  a  context  management 
operation  is  specified  in  a  single  instruction,  the  number  of  system  clock  cycles  needed  to  execute 
local-cond  in  a  generic  SIMD  computation  is  given  below; 


runtime  for  local-cond  =  /(I  *  A  +  1  *  B)  cycles 

If  A  and  B  both  contain  cachable  instructions,  then  the  entire  loop  body  can  be  placed  in  cache, 
resulting  in  the  following  program  structure: 

program  local-cond-cacbe; 
store  loop  body  in  cache 
for  i  =  1  to  I  do 

activate  cached  loop  body 
end; 

end  local-cond-cache; 

The  runtime  and  speedup  for  local-cond-cache  are  given  below: 

runtime  for  local-cond.  cache  =  1+4  +  1+  5  +  K- — -  *  ^ — —)  cycles 

Pb 

ideal  I-cache  speedup  7(1  +  A  +  1  +  5) 

for  local-cond-cache  1  +  A  +  1  +  B  + 
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Substituting  the  constant  C-1  *  A*1  *  B  into  this  equation  gives  the  speedup  formula: 

ideal  I-cache  speedup  _  I  *C 
for  local-concLcache  C  *  I 

Ph*  I 

which  is  the  simple  speedup  formula  first  introduced  in  Equation  2.2.  Here,  though,  multiple 
instruction  sequences  are  concatenated,  separated  by  context  management  instructions,  to  form 
large  cache  blo^.  Longer  cache  blocks  are  less  susceptible  to  the  quantization  effect,  lb  the  extent 
that  local  data-dependence  increases  the  lengths  of  instruction  sequences  appearing  in  loops,  local 
data-dependence  increases  the  I-cache  speedup. 

This  effect  is  not  s\nrprising,  for  the  following  reason:  SIMD  computer  architecture  is  motivated 
by  a  desire  to  remove  redundant  program-control  components  firom  the  PE.  Some  amoimt  of  pro¬ 
gram  control  is  appropriate  for  executing  locally  data-dependent  programs,  because  the  sequence  of 
executed  instructions  depends  on  local  conditions.  I-cache  can  be  seen  as  a  way  of  putting  a  small 
amoxmt  of  program-control  logic  back  into  the  PE  chip  along  with  the  PEs.  It  makes  sense  that  the 
more  independent  program  control  a  program  requires,  the  better  I-cache  performs,  within  limits. 
The  trick,  of  course,  with  respect  to  making  the  best  use  of  available  chip  area,  is  to  have  just  enough 
program  control  in  the  PE  chip. 

Global  data-dependence  has  a  very  different  impact  on  I-cache  speedup.  One  reason  is  that  global 
branches  depend  on  conditions  which  require  input  firom  all  of  the  PEs.  The  response  network, 
usually  used  to  query  the  state  of  the  PEs,  is  ty^ncally  a  low-dock-rate  MCS.  Therefore,  a  program 
with  ^quent  global  branches  is  Ukely  to  be  lesponse-bound.  Another  reason  that  global  data- 
dependence  acts  to  lower  I-cache  speedup  is  that  response  instructions  cannot  be  placed  in  cache. 
Whenever  a  response  instruction  is  executed,  global  information  regarding  the  ensuing  instructions 
to  be  executed  is  required,  which  prevents  the  local  controller  firom  forging  ahead  inside  the  PE  chip. 
F'inally,  if  the  global  branch  target  instructions  are  not  in  cache,  they  need  to  be  stored  there  through 
the  slow  global  instruction  broadcast  network. 

The  difference  between  local  and  global  data-dependence  is  summed  up  by  observing  the  sequence 
of  instructions  delivered  to  the  PEs.  Althoi'gh  both  kinds  of  data-dependence  affect  the  instruction 
sequence  executed  by  PEs,  locally  data-dependent  instruction  sequences  are  delivered  to  the  PEs 
obliviously  of  how  they  are  executed.  In  other  words,  the  context  management  instructions  determine 
what  actually  happens  within  the  PE,  but  the  sequence  of  instructions  that  is  delivered  to  the  PEs 
does  not  depend  on  the  values  of  intermediate  data.  Not  so  global  data-dependence.  A  globally 
data-dependent  condition  selects  one  of  mxiltiple  candidate  instruction  sequences  for  delivery  to  the 
PEs.  Local  data-dependence  tends  to  lengthen  sequences  of  cachable  instructions,  increasing  I-cache 
speedup,  whereas  global  data-dependence  tends  to  curtail  cachable  sequences,  decreasing  I-cache 
speedup. 

4.5  SIMD  Instruction  Cache  Management 

An  I-cached  SIMD  computation  embodies  two  concurrent  control  threads:  one  running  on  the  system 
controller,  and  another  running  replicatedly  on  the  local  controller  in  each  PE  chip.  This  control 
concurrency  represents  a  departure  firom  generic  SIMD  computation,  wherein  a  single  program 
running  on  the  system  controller  specifies  on  every  clock  cycle  both  the  activity  within  the  system 
controller  and  the  instruction  delivered  to  the  PEs. 

The  system  controller  maintains  the  program’s  principal  control  sequence.  The  thread  running 
on  the  local  controller  is  not  always  active;  when  the  needed  instructions  are  not  present  in  cache. 


68 


CHAPTER  4.  LCACHED  SIMD  COMPUTER  DESIGN 


the  local  controller  is  said  to  be  locked  to  the  global  instruction  broadcast  stream.  Any  activity  that 
occurs  within  the  PE  chip  while  the  local  controller  is  locked  occurs  at  the  instruction  broadcast  rate. 
When  the  local  controller  is  locked,  either  the  PEs  idle  or  they  execute  whatever  PE  instructions 
are  broadcast  fi*om  the  system  controller,  depending  on  the  cache  design.  In  any  event,  the  cache 
controller  itself  operates  continually  under  control  of  globally  broadcast  instructions.  When  a  cache 
block  has  been  fully  stored  in  the  cache,  the  system  controller  is  able  to  broadcast  a  fork  activating 
that  cache  block.  When  the  local  controller  begins  executing  that  cache  block,  the  cache  controller 
sequences  through  the  cache  block  at  the  PE  clock  rate.  When  cache  block  execution  terminates,  the 
local  controller  re-locks  to  the  globally  broadcast  instruction  stream. 

A  program  coimter  in  the  cache  controller  advances  at  a  rate  different  from  that  of  one  in  the 
system  controller.  Since  the  relative  rates  of  advance  are  fixed  and  known  statically,  it  is  possible  for 
the  system  controller  to  maintain  an  accurate  model  of  the  state  of  the  cache  controller.  This  model 
is  used  in  system  controller  cache  management. 

I-cache  management  is  a  programming  problem  of  assigning  repeated  instructions  to  cache  blocks, 
causing  cache  blocks  to  be  stored,  and  causing  them  to  be  activated.  Static  management  of  ordinary 
direct-mapped  instruction  caches  is  used  to  minimize  conflict  misses  among  frequently  executed 
instructions  [57].  This  section  illustrates  the  sub-problems  of  I-cache  management  by  showing 
examples  of  statically  managing  F-family  I-cache  variants.  For  static  cache  management,  the  sub¬ 
problems  are  solved  at  compile  time.  The  illustrations  presented  here  are  representative  also  of 
dynamic  I-cache  management  actions;  although  the  program  transformations  are  applied  during 
the  computation  under  dynamic  I-cache  management  rather  than  beforehand,  the  modifications 
themselves  are  the  same  as  for  static  I-cache  management. 

4.5.1  Step  1:  Identify  Cachable  Instructions 

An  assembly  language  program  specifies  a  sequence  of  operations  that  makes  up  a  SIMD  computa¬ 
tion.  That  assembly  language  program  is  translated  into  a  machine  code  program  which  is  stored  in 
system  controller  instruction  memory  prior  to  the  computation.  The  assembly  language  instructions 
specify  system  controller  operations  as  well  as  PE  FU  and  MCS  operations.  The  instructions  that 
are  globally  broadcast  during  a  generic  SIMD  computation,  by  contrast,  are  PE  machine  code  in¬ 
structions  that  do  not  specify  program-control  operations.  For  I-cached  SIMD  computation,  a  limited 
set  of  program-control  instructions  forming  a  cache-control  protocol  are  added  to  the  set  of  globally 
broadcastable  instructions. 

The  goal  of  I-cache  is  to  place  aU  globally  broadcast  instructions  that  are  repeated  in  cache 
and  subsequently  deliver  them  to  the  PEs  fix)m  cache  memory  at  a  high  rate.  The  machine  code 
instructions  corresponding  to  some  assembly  language  program  instructions  cannot  be  placed  in 
cache,  because  a  given  I-cache  variant  may  lack  the  facihties  to  pjerform  the  associated  program- 
control  operations. 

For  example,  instructions  that  alter  global  control  flow  are  im-cachable  with  an  F©  I-cache  vsuiant. 
Only  a  restricted  subset  of  program-control  operations,  those  controlling  fixed-iteration-count  loop 
iteration,  are  cachable  with  an  F2  I-cache  variant. 

Some  assembly  language  program  instructions  sp)ecify  system  controller  indexer  subsystem  op¬ 
erations.  These  opjerations  calculate  loop>-index-dep>endent  values  and  are  used  to  form  literals 
for  global  broadcast  to  the  PEs  or  to  form  system  data  memory  addresses.  The  F-family  contains 
simple  I-cache  variants  that  do  not  include  indexer  subsystems,  so  the  machine  code  instructions 
corresponding  to  such  assembly  language  instructions  are  un-cachable  for  F-family  caches. 

A  basic  block  is  a  sequence  of  instructions  containing  no  conditional  branching  and  no  branch 
targets  [29](p.478);  the  instructions  in  a  basic  block  are  executed  in  a  group  irresi)ective  of  problem- 
instance  data.  For  SIMD  programs,  PE  context  management  instructions  delimit  basic  blocks  with 
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respect  to  code  re-ordering  because  they  delimit  the  boundaries  of  conditionally  executed  instruction 
sequences.  However,  context  management  instructions  do  not  affect  cachability,  becaiise  context 
management  operations  restrict  the  side-effects  of  instruction  execution  rather  than  the  order  in 
which  subsequent  instructions  are  executed. 

The  fact  that  some  instructions  other  than  conditional  branches  are  un-cachable  for  simple  I- 
cache  variants  means  that  caching  is  restricted  to  sub-sequences  of  basic  blocks.  An  un-cachable 
instruction  has  the  greatest  negative  impact  for  I-cache  variants  that  are  capable  of  iterating  cached 
loops:  an  un-cachable  instruction  in  a  loop  body  prevents  iterating  the  loop  body  in  cache. 

It  is  straightforward  to  identify  cachable  instructions  firom  an  assembly  language  program  for  a 
given  I-cache  variant.  All  of  the  resulting  machine  code  instructions  are  cachable,  with  the  exception 
of  those  corresponding  to  assembly  language  instructions  that  specify  program-control  functions  that 
the  I-cache  variant  is  incapable  of  performing. 

4.5.2  Step  2:  Determine  Which  Sequences  Become  Cache  Blocks 

The  objects  that  go  into  a  SIMD  instruction  cache  are  sequences  of  instructions,  rather  than  indi¬ 
vidual  instructions.  The  reason  is  a  consequence  of  the  exphcit  management  of  I-cache:  at  least  one 
extra  cache-control  instruction  is  reqviired  to  place  an  instruction  sequence  in  cache,  and  at  least 
one  cache-control  instruction  is  required  to  activate  a  cache  block.  There  is  no  benefit  to  caching 
an  individual  instruction  (unless  it  appears  alone  in  an  iterated  loop  body,  and  then  only  for  an  F2 
or  higher  I-cache  variant  capable  of  iterating  cache  blocks).  This  restriction  further  differentiates 
SIMD  instruction  cache  firom  t3qncal  instruction  caches,  wherein  it  is  profitable  to  store  individual 
instructions.  (Tb  exploit  spatial  locality  in  memory  references,  fixed-size  groups  of  instructions  are 
stored  collectively  as  lines  in  ordinary  caches  [75](p.477).  However,  the  use  of  multiple-instruction 
lines  is  orthogonal  to  the  exploitation  of  temporal  locality  that  is  the  fundamental  motivation  behind 
caching  [85].) 

A  cachable  instruction  sequence  that  does  not  become  a  cache  block  is  said  to  be  excluded  from 
cache.  Not  every  instruction  sequence  that  is  a  candidate  for  caching  can  be  beneficially  executed 
from  cache.  There  are  two  possible  reasons  for  excluding  a  candidate  instruction  sequence:  placing 
the  instruction  sequence  in  cache  may  not  speed  up  the  execution  of  that  sequence,  or  the  instruction 
sequence  may  compete  for  cache  space  with  other,  more  profitably  cached  alternative  sequences. 

•  Exclusion  due  to  ineffectiveness: 

Some  instruction  sequences  take  no  less  time  when  executed  firom  cache  than  when  globally 
broadcast.  Such  is  the  case,  for  example,  for  non-iterated  single-instruction  sequences.  As 
another  example,  caching  may  not  speed  up  an  instruction  sequence  that  is  subsystem-bound. 
Because  storing  a  cache  block  represents  a  time  overhead  proportional  to  the  length  of  the 
sequence,  caching  an  instruction  sequence  for  which  there  is  no  cache  speedup  actually  slows 
down  the  computation. 

A  single  instruction  that  is  not  the  body  of  a  one-instruction  iterated  cache  block  cannot  be  exe¬ 
cuted  faster  fix>m  cache,  irrespective  of  pb  and  of  the  niunber  of  times  that  particxilar  instruction 
is  used.  Such  a  cache  block  is  excluded  fix)m  an  F©  cache.  An  F2  cache  is  capable  of  iterating 
cache  blocks,  so  single-instruction  sequences  are  not  necessarily  excluded  firom  an  F2  cache. 

The  possibihty  of  ineffectiveness  makes  it  important  to  estimate  statically  the  speedup  fi*om 
caching  a  given  instruction  sequence.  Such  estimation  is  discussed  in  Section  B.4.5. 

•  Exclusion  due  to  competition: 

Because  F©  and  F2  I-cache  variants  contain  only  one  cache  block  at  a  time,  all  cachable  instruc¬ 
tion  sequences  compete  for  cache  space  with  each  other:  whenever  a  block  is  cached  in  an  F©  or 
F2  cache,  it  displaces  the  previously  cached  block. 
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Competition  for  cache  space  as  occurs  among  instruction  sequences  whose  executions  alternate 
during  the  computation.  Competition  may  lead  to  exclusion  of  some  of  the  competing  sequences. 
The  determination  of  whether  to  exclude  one  or  the  other  of  a  mutually  conflicting  pair  of 
cachable  sequences  rests  on  the  tradeoff  between  the  time  to  store  each  cache  block  versus 
the  time  saved  by  running  that  block  flrom  cache.  If  the  time  saved  by  caching  the  less- 
profitably  cached  instruction  sequence  is  less  than  the  time  to  re-store  the  more-profitably 
cached  instruction  sequence,  then  the  less-profitably  cached  instruction  sequence  should  be 
excluded. 


4.5.3  Step  3:  Determine  Where  in  Cache  to  Place  Blocks 

There  is  no  decision  to  be  made  here  for  Fq  or  F2  caches,  because  these  caches  contain  only  single 
cache  blocks  at  a  time.  In  the  general  case,  this  problem  is  equivalent  to  the  storage  management 
problem  solved,  for  example,  by  segmented  virtual  memory  replacement  algorithms.  Note  that  the 
correct  solution  depends  collectively  on  the  dynamic  execution  characteristics  of  a  program’s  cache 
blocks  and  their  sizes;  this  sub-problem  is  arbitrarily  difficult  for  arbitrarily  complex  programs. 


4.5.4  Step  4:  Schedule  Cache  Blocks 

With  respect  to  scheduling  the  machine  code  instructions  in  a  program  controlling  a  SIMD  compu¬ 
tation,  time  is  measured  in  numbers  of  instruction  slots.  The  time  interval  of  a  global  broadcast 
instruction  is  pb  times  the  time  interval  of  a  cached  instruction.  Therefore,  operation  latencies  as 
measured  in  numbers  of  instruction  slots  are  pb  times  higher  for  cached  instructions  than  for  globally 
broadcast  instructions. 

Quantization  affects  the  latencies  of  globally  broadcast  instructions  but  it  does  not  affect  the 
latencies  of  cached  instructions,  as  measured  in  numbers  of  instruction  slots.  The  following  equations 
for  the  latency  of  an  operation  Y  using  MCS  X  whose  stepcoxmt  is  5(K)  illustrate  this  point: 


latency  ofV 


5(y)  ♦px' 

Pb 


global  broadcast  instructions 


5(y )  *px  cached  instructions 


The  difference  between  the  two  latency  measures  reflects  the  fact  that  MCS  X  receives  ^  clock 
pulses  during  every  global  instruction  broadcast  interval,  which  corresponds  to  one  clock  pulse  every 
px  cached  instructions. 

The  latency  difference  means  that  in  general,  the  machine  code  instruction  sequence  correspond¬ 
ing  to  a  given  assembly  lamguage  instruction  sequence  occupies  more  cached  instruction  slots  than 
global  broadcast  instruction  slots.  The  lengthening  of  instruction  sequences  for  caching  has  a  nega¬ 
tive  impact  on  I-cache  speedup,  because  the  time  to  store  a  cache  block  is  proportional  to  the  length 
of  the  cache  block.  The  cache  block  store  time  becomes  significant  if  it  is  not  amortized  over  many 
iterations  of  the  cache  block’s  execution.  Furthermore,  lengthier  cache  blocks  require  larger  cache 
memories,  increasing  the  chip-area  cost  of  a  useful  I-cache. 

In  any  event,  the  scheduling  dilation  of  cache  blocks  motivates  the  compression  of  sequences  of 
NOOPs  that  is  common  to  aU  the  F-family  caches:  in  a  manner  reminiscent  of  the  NOOP  compression 
technique  used  to  conserve  instruction  memory  in  VLIW  computers  [15](Sec.6.5.1),  a  sequence  of 
NOOPs  added  as  place  holders  representing  timing  delays  is  represented  compactly  in  an  F-family 
cache.  The  simple  encoding  scheme  associates  a  parameter  with  a  cached  NOOP  that  specifies  a 
number  of  cycles  for  which  to  suspend  incrementing  the  cache  program  counter  before  advancing 
to  the  next  instruction  in  the  cache  block.  Note  that  the  need  for  such  a  compression  scheme 
would  be  eliminated  were  the  multi-clock  generator  augmented  with  control  inputs  that  allow  clocks 
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regulating  idle  subsystenis  to  be  stopped  and  restarted  at  arbitrary  phase.  These  observations 
suggest  that  although  times  as  measured  in  instruction  intervals  are  large  for  cached  instructions 
than  for  globally  broadcast  instructions,  thia  difTerenoe  does  not  translate  into  cache  blocks  that  are 
proportionally  larger  than  their  globally  broadcast  coimterparts. 


4.5.5  Step  5:  Store  Cache  Blocks 

A  cache  block  is  stored  in  cache  before  it  can  be  activated.  The  storing  of  a  cache  block  is  accomplished 
by  globally  broadcasting  the  body  of  the  cache  block  in-between  a  pair  of  cache-control  operations 
demarcating  the  beginning  and  the  end  of  the  cache  block.  For  F-family  I-cache  veuiants,  L  2 
broadcast  instructions  are  needed  to  store  a  block  of  L  machine  code  instructions.  The  sequence  of 
L  *  2  instructions  used  to  store  the  cache  block  into  the  cache  is  called  the  preamble.  In  F-family 
caches,  the  final  instruction  of  the  preamble  is  stored  directly  in  the  cache  as  a  sentinel  identifying 
the  end  of  the  cache  block.  Alternatives  would  have  been  to  store  block-bounds  information  in  a 
special  part  of  the  cache,  or  to  designate  the  bounds  on  each  activation  of  a  cache  block.  Each  of  these 
alternatives  requires  storage  and/or  control  logic  comparable  in  size  to  the  single  cache  location  it 
would  save. 

A  question  that  arises  in  relation  to  storing  the  cache  block  is  where  in  the  program  to  place 
the  preamble.  The  general  answer  to  this  question  seems  to  be  that  it  doesn’t  matter  where  in  the 
program  the  preamble  appears,  subject  to  the  following  constraints: 

•  The  preamble  shoxild  be  executed  as  infirequently  as  possible,  because  doing  otherwise  amounts 
to  redimdantly  re-storing  the  block  in  the  cache. 

•  The  preamble  should  be  executed  as  late  as  possible  before  the  cache  block  is  executed,  because 
doing  otherwise  may  cause  the  cache  block  to  over-write  another  cache  block  that  is  still  iisefuUy 
resident  in  cache. 


4.5.6  Step  6:  Activate  Cache  Blocks 


Upon  activation  of  a  cache  block,  all  instructions  globally  broadcast  prior  to  completion  of  the  cache 
block’s  execution  are  not  executed  by  the  PEs.  The  nmnber  of  system  clock  cycles  for  the  execution 
of  a  cache  block  with  an  F-family  I-cache  varizmt  is  known  statically.  The  system  controller  is  free 
to  perform  useful  work  while  a  cache  block  is  active.  For  example,  for  two-port  I-cache  variants,  the 
system  controller  could  globally  broadcast  another  cache  block  to  be  pre-stored  through  the  second 
cache  memory  port.  Such  pre-storing  is  not  possible  for  F-family  I-cache  variants,  as  they  have  but 
a  single  cache  memory  port. 

With  Fo  I-cache  variants,  the  globally  broadcast  instruction  activating  a  cache  block  suppUes  no 
parameters,  because  the  length  of  the  block  is  determined  by  an  embedded  delimiter,  tbe  starting 
location  of  the  block  is  always  0,  and  there  is  no  iteration  of  the  cache  block.  The  number  of  system 
clock  cycles  that  occur  during  execution  of  an  F©  cache  block  of  length  L  is  given  in  Equation  4.16: 


duration  of  Fq  cache  block  execution 


L*\ 

Ph 


system  clock  cycles 


(4.16) 


The  additional  instruction  time  is  that  spent  recognizing  the  cached  sentinel  instruction  demar¬ 
cating  the  end  of  the  cache  block. 

With  F2  I-cache  variants,  the  globally  broadcast  instruction  activating  the  cache  block  specifies 
an  iteration  count.  The  number  of  system  clock  cycles  that  occur  during  execution  of  an  F2  cache 
block  of  length  L  iterated  /  times  is  given  in  Equation  4.17: 


duration  of  F2  cache  block  execution 


Pb 


system  clock  cycles 


(4.17) 
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4.6  Examples  of  Static  1-Cache  Management 

This  section  illustrates  static  I-cache  management  for  F-family  I-cache  variants  for  a  program  con¬ 
taining  multiple  cachable  instruction  sequences  whose  executions  alternate.  Consider  the  program 
twine: 


pzograa  twlM; 

fox  1  a  1  to  I  do 

fox  j  1  to  J  do 
A 

«nd; 

fox  j  1  to  K  do 
B 

ood; 

fox  j  :s  1  to  L  do 
C 

•nd; 

ood; 

•nd  twin*; 

Executions  of  instruction  sequences  A,  B,  and  C  alternate  dvuring  the  coxirse  of  the  computation. 
All  three  are  cached  in  a  multi-block  I-cache  variant  with  sufficient  cache  size.  For  one-block  I-cache 
variants,  or  for  m\ilti-block  variants  with  insxiffident  cache  size,  the  three  instruction  sequences 
compete  for  cache  space  and  replace  one  another  each  time  rotmd  the  outer  loop.  In  this  case,  the 
cache  block  for  A  would  be  used  J  times  before  being  overwritten  by  the  cache  block  for  B,  which 
would  be  used  K  times  before  being  over-written  by  the  cache  block  for  C,  which  would  be  used  L 
times  before  being  over-written  once  again  by  the  cache  block  for  A. 

Note  that  a  simply  nested  loop  structure  is  a  special  case  of  twine.  For  example,  if  7  >  1,  7*1, 
7i'  >  1 ,  and  7*0,  then  A  represents  the  outer  loop  body  and  B  represents  the  inner  loop  body. 

Static  Management  of  twine  for  an  Fq  I-cache  Variant 

There  are  many  possible  ways  to  manage  twine  for  a  "one-block,  one-shot”  I-cache  variant.  The  best 
way  depends  on  the  I-cache  speedups  for  the  individual  instruction  sequences.  If  aU  three  sequences 
3deld  large  I-cache  speedups,  then  the  best  way  to  use  an  Fq  I-cache  variant  is  shown  as  program 
twine-Fo-all,  wherein  each  block  is  re-stored  in  cache  on  each  iteration  of  the  outer  loop: 

pxogxam  twiM-Fo-all; 
fox  1  B  1  to  I  do 
store  A's  cache  block 
fox  j  :=  1  to  J  do 
activate  A’s  cache  block 
•nd; 

store  B’s  cache  block 
for  j  :=  1  to  K  do 
activate  Bs  cache  block 
•nd; 

store  C’s  cache  block 
fox  j  :=  1  to  L  do 
activate  C’s  cache  block 
•nd; 

•nd; 

•nd  twin«_Fo.all; 

It  might  be  the  case,  however,  the  greatest  speedup  is  obtained  by  placing  just  one  of  the  sequences 
in  cache  and  leaving  it  there  imdisturbed  throughout  the  computation.  would  be  the  case,  for 
example,  if  sequences  A  and  C  were  subsystem-bound  but  sequence  B  were  not.  In  this  case,  the 
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best  way  to  use  an  Fo  I-cache  variant  is  shown  as  program  twinaJFo-best,  wherein  the  cache  block 
for  B  is  stored  jiist  once,  at  the  outset  of  the  computation: 

prograa  twiaMj'o^Jbast; 

8t'>re  B's  cache  block 
tax  1  «  1  to  Z  do 

for  j  1  to  J  do 

A 

«nd; 

for  j  :■  1  to  K  do 
activate  B’s  cadie  block 
«nd; 

for  j  :s  1  to  L  do 
C 
•nd; 

«od; 

•nd  twln*J'o-BJbont; 


Static  Management  of  twine  for  an  Fi  I>cache  Variant 

A  "multi-block”  I-cache  variant,  including  Fi,  F3,  and  F7,  is  ideal  for  loop  structirres  such  as  that 
exhibited  by  twine.  Subject  to  capacity  limitations,  a  multi-block  I-cache  is  able  to  contain  all  three 
cache  blocks  at  once,  so  the  I-cache  speedup  fiom  caching  each  sequence  can  be  realized  while  paying 
the  store  overhead  just  on(^  per  block,  at  the  outset  of  the  computation.  The  result  of  static  cache 
management  for  Fi  is  illiistrated  in  program  twine-Fx : 

program  twlsa-Fi; 
store  A'<  cache  block 
store  B’s  cache  block 
store  C’s  cache  block 
for  1  «•  1  to  X  do 

for  j  :ae  1  to  J  do 
activate  A’s  cache  block 
•nd; 

for  j  1  to  K  do 
activate  B’s  cache  Mock 
•nd; 

for  j  1  to  L  do 
activate  C's  cache  block 
•nd; 

•nd; 

•nd  talnn-Fi; 


Static  Management  of  twine  for  an  F2  I-cache  Variant 

F2  is  a  “one-block,  multi-shot”  I-cache  variant.  Cache  management  for  F2  is  subject  to  the  same 
considerations  as  for  Fi .  If  all  three  sequences  &om  twine  are  profitably  cached,  then  twine-F2-all 
resxilts: 


program  twln«J'2jall; 
for  i  «  1  to  Z  do 
store  A’s  cache  block 
activate  J  iterations  of  A’s  cache  block 
store  B’s  cache  block 
activate  K  iterations  of  B’s  cache  block 
store  C’s  cache  block 
activate  L  iterations  of  C’s  cache  block 
•nd; 

•nd  twlnnJ'z.all; 


74 


CHAPTER  4.  I-CACHEDSIMD  COMPUTER  DESIGN 


If  the  mstructions  in  the  individual  sequences  are  such  that  it  is  best  only  to  cache  sequence  B, 
then  twine.F2-BJ3est  results: 


prograa 

store  B's  cache  block 
tor  1  ■  1  to  X  do 

for  j  1  to  J  do 
A 

•Dd; 

aetiuate  K  iterations  ofWs  cache  block 
fox  j  1  to  L  do 
C 


•ad  twiaa-FaAJbast; 


Static  Management  of  twine  for  an  Fa  I-cache  Variant 

Fa  is  similar  to  Fi .  Fa  has  the  ability  to  execute  multiple  iterations  of  a  cache  blocks  firom  a  single 
activation.  If  cache  memory  is  sufficiently  large  to  hold  all  three  blocks,  then  program  twine-Fa 
results: 


program  twlna-Fs; 
stare  Ns  cache  block 
store  B’jt  cache  block 
store  C’s  cache  block 
tor  1  a  X  to  I  do 

activate  J  iterations  of  Ns  cache  block 
activate  K  iterations  ofWs  cache  block 
activate  L  iterations  of  C’s  cache  block 
«od; 

•ad  twlaoJPs; 


%atic  Management  of  twine  for  an  F7  I-cache  Variant 

F7  is  the  most  complex  member  of  the  F-family.  The  entire  program  body  may  be  stored  intact  with 
an  F7  I-cache  variant: 


program  twixM-F?; 
store  Ns  cache  block 
store  B’a  cache  blodt 
store  C’s  cache  block 

store  the  outer  loop’s  cache  block,  which  activates  the  others 
activate  I  iterations  of  the  outer  loop’s  cache  block 
•nd  tirin«-F7; 


Chapter  5 

I-Cache  Evaluation 


The  complex  interactions  of  I-cache  capabilities  with  logical  properties  of  programs  and  electrical 
characteristics  of  SIMD  computers  make  it  difficult  to  discern  a  priori  the  I-cache  speedup  for  a 
given  SIMD  computation.  This  analytical  difficulty  motivates  empirical  evaluation.  Measiirements 
of  I-cache  speedup  against  varying  computation  parameters  provide  a  basis  for  evaluating  the  factors 
that  affect  I-cache  speedup. 

Designs  for  Fo  and  F2,  two  of  the  simplest  members  of  the  F-family,  have  been  evaluated  empiri¬ 
cally  on  a  detailed  model  of  SIMD  computation.  Speedup  measurements  have  been  performed  over 
a  set  of  SIMD  computer  variants  for  a  diverse  collection  of  sample  programs.  The  SIMD  computer 
variants  were  chosen  to  represent  a  range  of  existing  SIMD  computers.  The  PE  datapath  widths  in 
the  SIMD  computer  variants  range  from  1  bit  to  32  bits.  The  sample  programs  for  which  evaluations 
were  performed  were  chosen  to  span  a  broad  range  of  properties  of  data-parallel  problems.  Tbgether, 
the  SIMD  computer  variants  and  sample  problems  on  which  I-cache  variants  were  simulated  cover 
a  large  region  of  the  space  of  SIMD  computations  m  which  to  explore  I-cache  speedups,  limitations, 
and  costs. 

Resvdts  obtained  using  the  simulator  show  that  the  multi-dock  generator,  introduced  to  provide 
the  multiple  high-rate  docks  needed  for  I-cache,  does  not  provide  substantial  speedups  on  its  own 
without  I-cache.  However,  even  the  simplest  I-cache  variant  (Fq)  yields  substantial  speedups.  An  F2 
I-cache  has  the  additional  capability  of  iterating  cache  blocks.  The  measured  results  confirm  that  the 
ability  to  iterate  cache  blocks  smoothes  the  quantization  effect  and  yields  higher  overall  speedups. 
For  the  subject  problems  solved  on  a  SIMD  computer  with  wide-word  PEs,  the  evaluations  show  that 
these  simple  I-cache  variants  do  not  need  to  be  able  to  store  more  than  50  instructions.  For  some 
sample  problems,  the  simple  I-caches  yield  speedups  near  the  highest  possible,  while  for  others  the 
speedups  are  much  less  than  maximum.  Meastirements  of  the  sensitivities  of  I-cache  speedup  to 
programs’  MCS-intensiveness  show  the  sxuprising  resdt  that  increasing  MCS-intensiveness  either 
increases  or  decreases  I-cache  speedup,  dependent  upon  the  MCS’  electrical  characteristics. 


5.1  A  Simulator  for  SIMD  Computations 

The  throughput  of  a  SIMD  computation  is  measiued  using  a  detailed  simulator  for  the  basis  computer 
that  is  described  in  Appendix  A.  Programmed  in  machine  code,  the  simulator  represents  the  state 
variables  of  a  SIMD  computer  explicitly,  and  it  updates  the  state  variables  on  each  clock  phase.  The 
simulator  is  parameterized  so  as  to  be  able  to  capture  the  characteristics  of  a  broad  range  of  generic 
SIMD  computers.  I-cached  SIMD  computers  with  Fo  or  F2  I-cache  variants  are  also  simulated.  To 
check  whether  apparent  speedup  is  due  to  I-cache  or  merely  to  the  clocking  of  all  subsystems  at 
their  highest  rates,  the  simulator  also  represents  multi-clock  SIMD  computers.  A  multi-clock  SIMD 
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computer  has  all  of  the  elements  of  an  I-cached  SIMD  computer,  excluding  cache  memory  and  cache 
controller  in  the  PE  chip. 

In  the  basis  computer,  there  is  no  program  control  for  the  PE  FU  inside  the  PE  chip.  A  machine 
code  instruction  controls  one  PE  clock  cycle  of  FU  activity.  Sequences  of  single-cycle  machine  code 
instructions  axe  needed  to  perform  some  assembly  language  operations.  For  example,  addition  of 
32-bit  operands  requires  8  single-cycle  instructions  on  a  4-bit  FU. 

The  “Control  and  pin  access”  block  within  the  PE  chip  (shown  in  Figure  3.2)  constitutes  special- 
purpose  local  control  for  the  MCSs.  When  an  MCS  operation  requires  multiple  clock  cycles  to 
complete,  this  block  provides  the  control  for  the  MCS  in  clock  cycles  intervening  between  a  pair  of 
instructions  initiating  and  terminating  the  multiple-cycle  MCS  operation.  Why  assume  that  there 
is  MCS  control  inside  the  PE  chip  when  there  is  no  program  control  inside  the  PE  chip?  The  typical 
simplicity  of  the  control  and  pin  access  circuit,  as  for  example  that  illustrated  for  local  external 
memory  access  in  Figxire  A.2,  facilitates  its  inclusion  in  the  PE  chip.  By  contrast,  the  sequencing  of 
primitive  instructions  to  realize  arithmetically  complicated  FU  operations  is  not  necessarily  simple. 
The  PE  chips  of  some  existing  SIMD  computers  include  some  FU  control,  while  in  others  there  is  no 
on-chip  MCS  control.  The  assumptions  made  for  the  basis  computer,  reflected  in  the  machine  code 
instruction  set,  introduce  a  degree  of  “machine-dependence”  with  respect  to  the  severity  of  the  global 
instruction  broadcast  rate  limitation,  because  the  FU  is  always  instruction  dehvery-bound,  whereas 
an  MCS  is  not  necessarily  so.  Providing  FU  control  inside  the  PE  chip  would  lessen  the  apparent 
instruction  dehvery-boundedness  of  programs,  whereas  assiiming  no  MCS  control  inside  the  PE  chip 
would  increase  the  apparent  instruction  delivery-bovmdedness  of  programs.  The  machine-dependent 
assumptions  made  for  the  basis  computer  are  a  reasonable  middle  grovmd. 

A  further  machine-dependent  aspect  of  the  basis  computer  is  the  genericity  of  MCSs.  The  MCS 
abstraction  (which  was  introduced  in  Section  3.4)  limits  the  faithfulness  with  which  MCSs  of  some 
SIMD  computers  are  represented.  On  the  other  hand,  this  generic  representation  allows  a  wide 
variety  of  specific  MCSs  to  be  accommodated  in  the  one  simulator. 

Despite  using  a  specific  machine  code  instruction  set,  the  simulator  is  parameterized  so  that  it 
may  reflect  closely  the  widely  varying  characteristics  of  existing  and  foreseeable  VLSI-based  SIMD 
computers.  In  simulating  generic  SIMD  computations,  the  following  characteristics  are  pEurameters: 

•  program  loop  struct\ire, 

•  program  data-dependence, 

•  problem  size, 

•  inter-PE  commiinication,  as  determined  by  the  topological  relationship  between  sub-problems 
of  a  data-parallel  problem  and  the  inter-PE  communication  network, 

•  PE  memory  usage,  as  determined  by  the  allocation  of  problem  data  and  intermediate  results  to 
registers  and  to  locations  in  local  external  memory, 

•  number  of  PEs  in  the  computer, 

•  number  of  registers  per  PE, 

•  PE  datapath  width, 

•  PE  FU  circuit  complexity,  and 

•  number  of  PEs  per  PE  chip. 
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Values  for  some  of  these  parameters  are  explicit  in  the  assembly  language  program  used  to 
describe  an  operationally  structured  subject  computation.  Th(»e  parameter  values  that  determine 
operation  stepcounts^  are  specified  as  inputs  to  the  automatic  translation  of  an  assembly  language 
program  into  the  machine  code  program  that  controls  the  simulated  computation. 

A  p-set^  provides  five  additional  parameters  used  in  the  simulation  of  multi-clock  SIMD  com¬ 
putations.  A  p-set  reflects  both  the  VLSI  implementation  technique  chgiracteristics  that  determine 
electrical  propagation  characteristics  of  wires  and  the  MCS  topologies  that  determine  wire  lengths 
and  electrical  loads.  A  p-set  has  the  form  {pb»Pr,Pi,Pc.Pi}.  wherein  each  value  is  the  ratio  of  the 
PE  clock  rate  to  the  rate  of  a  clock  regulating  an  MCS.  The  values  in  a  p-set  specify  the  clock  rates 
of  global  instruction  broadcast,  response,  system  memory  data  I/O,  inter-PE  communication,  and 
local  external  memory,  respectively,  p-set  values  guide  the  translation  firom  assembly  language  into 
machine  code. 

Fq-  and  F2-enhanced  SIMD  computations  are  also  simiilated.  For  this  purpose,  the  appropriate 
cache-control  protocol  instructions  are  available  in  the  instruction  set.  Cache  size,  the  total  number 
of  cache  memory  locations,  is  an  additional  parameter  used  in  the  simulation  of  I-cached  SIMD 
computation. 

Having  verified  assembly  language  program  correctness  and,  where  necessary,  having  measured 
data-dependent  iteration  covmts  through  complete  simulation  of  a  computation  for  a  given  data  set, 
throughput  can  then  be  measured  statically  for  different  p-sets  through  simple  instruction  counting. 
Measuring  throughput  in  this  way  saves  considerable  simulation  time. 


5.2  Speedup  Measurement  Method 

The  empirical  method  used  to  measure  I-cache  speedup  is  illustrated  in  Figure  5.1 .  The  method 
comprises  the  following  steps: 

1.  Evaluation  begins  at  the  upper-most,  left-most  shaded  box  in  Figure  5.1.  A  generic  SIMD 
computation  is  represented  as  an  assembly  language  program.  The  assembly  language  pro¬ 
gram  used  as  the  starting  {)oint  reflects  problem  parameters  (including  data-dependence  and 
the  topology  of  sub-problem  data  sets)  and  hardware  parameters  (including  inter-PE  commu¬ 
nication  network  topology,  PE  count,  and  PE  register  count).  The  assembly  language  program 
specifies  a  sequence  of  PE  and  MCS  operations  and  their  dependence  relationships.  The  as¬ 
sembly  language  progreun  is  thus  an  operationally  structured  description  of  the  computation. 

2.  The  subject  computation  is  then  transformed  into  physically  structured  variants  represented 
as  machine  code  programs.  Operation  latencies  are  explicit  in  a  physically  structured  com¬ 
putation.  The  machine  code  program  therefore  reflects  PE  parameters  including  FU  circuit 
complexity  and  datapath  width,  and  MCS  parameters  including  PE  chip  pin  time-sharing  and 
communication  network  diameters.  A  distinct  throughput  baseline  is  established  by  simulating 
the  computation  described  by  each  distinct  set  of  parameter  values. 

3.  The  assembly  language  description  of  the  subject  computation  is  then  again  transformed,  this 
time  into  physically  structured  variants  of  multi-clock  SIMD  computation.  A  unique  variant  is 
defined  by  each  unique  p-set.  Simulations  jdeld  throughput  measurements  that  are  compared 
against  the  baseline  for  that  computation. 

4.  The  assembly  language  description  of  the  subject  computation  is  then  modified  to  reflect  en¬ 
hancement  with  either  Fq  or  F2.  The  modification  includes  adding  cache-control  instructions 
needed  for  static  management  of  I-cache,  as  discussed  in  Section  4.5. 


*  As  defined  in  Section  3.6,  the  stepcount  of  an  operation  is  the  number  of  clock  cycles  taken  to  perform  it. 
^The  use  of  p-sets  to  characterize  relative  sub^stem  clock  rates  is  introduced  in  Section  3.9. 
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Figure  5.1 :  Method  for  Measiuing  I-Cache  Speedup 
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5.  The  resvilting  assembly  language  description  of  the  1-cached  computation  is  then  transformed 
into  physically  structured  variants  that  reflect  hardware  parameters  including  cache  size.  Sim¬ 
ulations  yield  throughput  measiirements  that  are  compared  against  the  baseline  for  that  com¬ 
putation  and  against  same-/>-set  multi-clock  counterparts. 

Dotted  arrows  and  ghost  boxes  in  Figure  5.1  indicate  that  a  number  of  alternative  transformations 
may  be  applied  at  a  given  step.  The  large  branching  factor  in  the  tree  of  transformations  shown  in 
Figure  5.1  makes  apparent  the  enormity  of  the  I-cached  SIMD  computer  design  space. 

The  translation  from  assembly  language  into  machine  code  has  a  slight  impact  on  results.  The 
assembler/scheduler  that  produces  machine  code  from  an  assembly  language  program  attempts 
to  achieve  the  greatest  degree  of  overlap  among  flow-independent  instructions.  The  scheduling 
algorithm  is  heuristic  and  imperfect,  and  variations  in  the  optimizers  success  inject  a  degree  of  “noise” 
into  throughput  measurements.  In  general,  good  scheduling  algorithms  are  difficxilt  to  write  [49]. 
Although  imperfect,  the  scheduling  algorithm  used  uniformly  for  all  translations  to  machine  code 
shown  in  Figure  5.1  happens  not  to  be  a  very  bad  one.  On  one  hand,  the  variability  introduced  by 
idios}nicrasies  of  that  scheduling  algorithm  adds  to  the  realism  of  the  simulation  resxilts.  On  the  other 
hand,  the  scheduler’s  imperfections  limit  the  generality  of  the  results.  The  results  reported  here, 
based  on  detailed  simulations  imder  realistic  assumptions,  indicate  the  speedups  that  would  likely 
be  obtained  were  a  contemplated  I-cached  SIMD  computer  actually  constructed  and  its  throughput 
measured  relative  to  its  generic  coimterpart. 


5.3  Preparing  A  Subject  Computation 

The  assembly  language  program  that  is  the  starting  point  for  I-cache  evaluation  is  an  input  to 
the  method  illustrated  in  Figure  5.1.  The  assembly  language  program  describes  an  operationally 
structured  generic  SIMD  computation.  That  program  is  prepared  outside  of  the  scope  of  the  empiric^ 
speedup  measurement  mechanism.  Assembly  language  programs  were  written  by  hand  for  each  of 
the  8  sample  problems  for  which  resvdts  have  been  obtained. 

In  a  subject  computation,  an  explicit  mapping  flx>m  problem  input  and  output  data  sets  to  PEs 
has  already  been  established.  In  general,  appropriate  mappings  are  difficult  to  find,  although  there 
are  automatic  means  for  finding  good  mappings  in  some  cases  [84]. 

In  a  subject  computation,  the  manner  in  which  the  PE  registers  and  local  external  memory  are 
used  has  already  been  established.  The  register  allocation  problem  for  SIMD  PEs  is  an  instance  of 
the  corresponding  problem  in  uniprocessor  computation.  In  general,  the  best  register  allocation  is 
difficult  to  find,  although  there  are  algorithms  for  finding  good  ones  [14]. 

In  an  operationally  structin^d  subject  computation,  program  branches  that  are  conditional  on 
PE  data  has  already  been  converted  to  PE  context  management  instructions.  How  little  of  the  entire 
program  is  executed  within  the  scope  of  PE  context  management  instructions  is  one  measure  of  the  ap¬ 
propriateness  of  a  problem  for  execution  on  a  SIMD  computer.  There  is  a  systematic  syntactic  schema 
for  converting  conditional  constructs  into  PE  context  management  instructions  [27](Sec.rV.B). 

The  system  controller  both  sequences  the  program  controlling  a  SIMD  computation  and  evaluates 
loop-index-dependent  expressions  for  the  PEs.  The  use  of  the  system  controller  is  explicit  in  the 
assembly  language  program  describing  a  subject  computation. 


5.4  Four  SIMD  Computer  Variants 

The  experimental  method  sketched  in  Figure  5.1  is  used  to  evaluate  I-cache  added  to  each  of  4  SIMD 
computer  variants.  The  computers  differ  primarily  in  the  numbers  of  PEs  per  chip,  in  PE  FU  widths 
(in  bits),  and  in  VLSI  implementation  technique.  Each  is  based  on  an  existing  SIMD  computer,  with 
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the  exception  of  the  last,  which  is  an  extrapolation  firom  the  VLSI  implementation  technique  used 
for  a  recent  microprocessor. 

This  section  describes  the  SIMD  computer  variants  by  giving  representative  stepcounts  for  as¬ 
sembly  language  operations.  The  complete  set  of  operations  is  given  in  Appendix  A.  For  each  SIMD 
computer,  stepcounts  are  given  in  this  section  for  operations  representative  of  a  class  of  operations 
applied  to  32-bit  operands  in  an  instance  of  the  SIMD  computer  containing  1024  PEs.  For  example, 
NOR  is  representative  of  bit-wise  logical  operations,  ADD  is  representative  of  carry-chain-based 
arithmetic  operations,  and  MULT  is  representative  of  more  complex  arithmetic  operations.  The 
following  table  describes  some  representative  operations: 


Operation  Name 

Meaning 

"  NOR" 

ADD 

MULT 

LC-PUSH_EQ 

LITERAL 

LOAD 

LDNO 

bit-wise  NOR 

addition 

multiplication 

context  management 

global  broadcast  literal 

local  external  memory  read 

neighbor-to-neighbor  inter-PE  communication 

Variant  SIMD-A  is  based  on  Blitzen  [36].  SIMD-A’s  PE  chip  contains  128  1-bit  PEs,  each  with 
IK  register  bits.^  Variant  SIMD-B  is  based  on  MP-1  [62].  SIMD-B’s  PE  chip  contains  32  4-bit  PEs, 
each  with  IK  register  bits.  Variant  SIMD-C  is  based  on  SLAP  [27].  SIMD-C’s  PE  chip  contains  4 
1 6-bit  PEs,  each  with  512  register  bits.  Variant  SIMD-D  is  based  on  the  technology  used  in  a  modem 
uniprocessor  implementation  [21].  SIMD-D’s  PE  chip  contains  2  32-bit  PEs,  each  with  8K  register 
bits.  Physical  characteristics  of  the  PE  chips  correspond  to  those  shown  in  Figure  3.2.  The  following 
table  summarizes  the  characteristics  of  the  4  SIMD  computer  variants: 


SIMD-A 

SIMD-B 

SIMT  C 

SIMD-D 

A 

O.bfim 

0.8^m 

l.Ofim 

0.375;xm 

PEs  per  chip 

128 

32 

4 

2 

FU  bit- width 

1 

4 

16 

32 

32-bit  registers  per  PE 

32 

32 

16 

256 

NOR  stepcount 

32 

8 

2 

1 

ADD  stepcoimt 

32 

8 

2 

1 

MULT  stepcoxmt 

263 

34 

1 

LC_PUSH_EQ  stepcount 

32 

8 

2 

1 

LOAD  stepcount 

128 

32 

4 

2 

LDNO  stepcoimt 

32 

8 

2 

1 

5.4.1  Sensitivity  of  Speedup  to  SIMD  Computer  Variant 

One  consequence  of  the  assumption  in  the  simulation  model  that  a  new  machine  code  instruction  is 
required  for  each  clock  cycle  of  an  FU  operation  taking  multiple  steps  is  that  every  FU  operation  is 
necessarily  instruction  delivery-bound.  The  effect  of  this  assumption  is  most  marked  for  PEs  with 
narrow- word  FUs,  wherein  many  clock  cycles  are  needed  to  perform  operations  on  32-bit  operands. 
Therefore,  SIMD-A  (with  1-bit  FUs)  is  the  most  instruction  delivery-bound  of  the  4  SIMD  computer 

^The  CM-2  is  a  better-known  computer  whose  PEs  are  also  1-bit  wide  [18].  The  CM-2  was  not  used  as  a  basis  here 
because  its  design  does  not  exploit  VLSI  implementation  technique  for  the  PEs.  3  local  external  memory  accesses  are  need 
in  CM-2  for  each  single-bit  full -adder  step  performed  by  the  PEs  [22](p.20).  Also,  some  of  the  arithmetic  operations  are 
performed  outside  of  the  PE  chip  in  CM-2. 
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variants,  while  SIMD-D  (with  32-bit  FUs  and  single-step  multiply)  is  the  least  instruction  delivery- 
bound  of  the  4  SIMD  computer  variants.  The  sensitivity  to  SIMD  computer  variant  is  apparent 
among  the  I-cache  speedup  measurements  for  each  problem  presented  in  Appendix  E. 


5.5  Eight  Sample  Problems 

This  section  introduces  the  8  sample  problems  for  which  I-cache  speedups  have  been  measured. 
While  the  problems  presented  here  vary  over  a  broad  range  in  their  program  characteristics,  this 
collection  should  not  be  construed  to  represent  comprehensively  all  data-parallel  problems.  The  8 
problems  were  selected  fiom  among  those  discussed  in  the  literature  for  their  simplicity  and  for  their 
diversity.  For  example,  problems  were  chosen  that  map  conveniently  to  various  inter-PE  topologies 
and  which  appeared  to  possess  a  variety  of  degrees  of  data-dex>endence  and  subsystem-boundedness. 

The  assembly  language  programs  are  included  in  Appendix  D.  Most  of  the  sample  programs 
begin  with  a  “prologue”,  wherein  the  input  data  set  is  moved  to  the  PEs  from  system  data  memory, 
continue  with  a  “kernel”  computation  within  and  amongst  the  PEs,  and  conclude  with  the  transfer 
of  the  output  data  set  from  the  PEs  to  system  data  memory. 

Each  of  the  sample  programs  has  been  evaluated  for  computations  using  very  large  data  sets. 
The  largest  data  sets  contain  about  one  million  elements,  the  niunber  of  pixels  in  a  IK  by  IK 
image.  The  objective  in  choosing  large  data  set  sizes  was  to  avoid  potentially  mis-leading  constants 
that  typically  obtain  for  smaller  data  sets.  Unfortunately,  for  half  of  the  problems  studied,  a  large 
data  set  under  a  datiun-per-PE  mapping  means  that  there  is  an  enormous  number  of  PEs  in  the 
simulated  computer.  For  computations  including  some  simulations  of  physical  systems,  the  validity 
of  resiilts  grows  with  the  size  of  the  data  set.  For  example,  the  finest  tractable  granularity,  and 
thus  the  largest  possible  data  set  size,  is  often  desirable  in  finite-element  analysis  computations  [66]. 
However,  despite  the  apparently  good  reasons  for  building  computers  containing  millions  of  PEs, 
their  cost  is  still  prohibitive  today. 

5.5.1  Tree-Summation  (tree) 

Logarithmic-time  summation  is  a  common  operation  on  arrays,  tree  is  based  on  the  algorithm 
sketched  in  [41KFig.  1),  wherein  P  PEs  are  arranged  as  a  tree  with  P  leaves.  Each  PE  is  assigned 
one  element  of  the  array  and  summation  takes  O  ( log  P)  steps.  In  the  iterated  loop  body,  the  (2z  1) 
active  PE  sends  its  accumulated  partial  sum  to  the  2**^  active  PE,  the  number  of  active  PEs  is  halved 
by  de-activating  the  odd-indexed  PEs,  and  the  still-active  PEs  add  the  newly  received  value  into  their 
partial  sums,  tree  achieves  the  desired  data  communication  using  a  routed  inter-PE  communication 
network. 

5.5.2  Plus-scan  (scan) 

The  parallel-prefix  sum  5  of  a  vector  v  is  defined  as  follows: 

i-i 

t=0 

Plus-scan  (also  called  “tree  +-scan”  in  [9](p.l535)  and  abbreviated  “scan”  herein)  uses  routed  inter- 
PE  communication  in  calculating  the  parallel-prefix  sum  of  a  P-element  vector  using  P  PEs.  scan 
comprises  logP  iterations  of  an  up-sweep  followed  by  logP  iterations  of  a  down-sweep.  Each  of  the 
sweeps’  loop  bodies  is  structured  similarly  to  the  loop  body  in  tree,  scan’s  complexity  is  greater 
than  that  of  tree  because  of  the  need  for  PEs  to  carry  forward  in  local  memory  partial  sums  from 
the  up-sweep  to  the  down-sweep  (as  suggested  in  Fig.  13  of  [9]). 
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5.5.3  Linear  Array  Bubble  Sort  (biibble) 

This  algorithm  is  a  straightforward  generalization  of  the  iiniprocessor  bubble  sort.  There  is  no 
asymptotically  faster  sorting  algorithm  on  a  linear  array. 

On  each  iteration  of  bubble’s  loop  body,  first  all  pairs  of  PEs  indexed  2t  and  2i  1  compare  their 
values,  swapping  where  PE  2i  +  1  contains  the  lower  value.  Then  pairs  of  PEs  indexed  2i  and  2i  -  1 
compare  their  values,  swapping  where  PE  2i  contains  the  lower  value.  That  the  two  PEs  indexed  0 
and  P  -1  must  be  disabled  during  the  second  half  of  the  loop  body  is  slightly  inconvenient.  The  loop 
iterates  imtil  the  values  are  ordered  fix>m  least  to  greatest  within  the  PEs.  The  number  of  iterations 
depends  on  the  initial  permutation  of  the  input  data.  There  are  at  most  f  iterations. 

5.5.4  Mesh  Row-Column  Sort  (rowcol) 

This  algorithm,  described  in  [55](Lec.5,p.l6),  sorts  P  numbers  on  a  \/P  x  v/P  mesh  in  time  O  ( \/Flog  P) 
The  loop  body  of  xowcol  first  sorts  the  rows  in  alternating  directions  and  then  sorts  the  columns 
upward.  The  individual  row  and  column  sorts  use  bubble  sort,  and  so  take  time  OCy/P),  the  length 
of  a  row  or  column.  Sorting  requires  at  most  log(v^)  =0(  logP)  iterations  of  the  loop  body,  so  the 
asymptotic  complexity  of  rowcol  is  OilogPy/P).  In  rowcol,  as  in  bubble,  the  iteration  counts 
depend  on  the  initial  permutation  of  the  data. 

5.5.5  Bitonic  Sort  (bitonic) 

bitonic  is  an  in-place  merge  sort  that  runs  in  O(log^P)  time,  bitonic  uses  a  routed  inter-PE 
communication  network  to  realize  the  communication  pattern  given  in  the  description  of  Batcher’s 
algorithm  in  [52](p.ll2). 

The  algorithm  works  as  follows:  an  unsorted  P-element  array  is  considered  initially  to  be  P  sorted 
1 -element  sub-arrays.  The  inner  loop  body  of  bitonic  cuts  in  half  the  number  of  sorted  sub-arrays 
while  doubling  the  size  of  each;  therefore  logP  iterations  of  the  loop  body  yield  the  desired  sorted 
P-element  array 

The  basic  operation  of  the  bitonic  sort  is  an  in-place  merge  that  produces  a  sorted  iV -element 
array  from  2  sorted  ^-element  arrays.  The  in-place  merge  exploits  the  fact  that  the  two  initial 
arrays  are  sorted,  so  that  individual  element-comparisons  yield  information  about  sub-ranges  of  the 
arrays.  The  fascinating  aspect  of  bitonic  is  that  the  addresses  of  PEs  pairwise-involved  in  element 
comparisons  depends  only  on  JV,  P,  and  the  PE  indices.  The  in-place  merge  comprises  O(logiV) 
comparisons,  and  while  the  swaps  are  conditional  on  the  compared  data,  the  schedule  of  comparisons 
is  independent  of  the  values  of  the  input  data. 

In  the  bitonic  loop  body,  the  in-place  merge  is  performed  concurrently  for  all  of  the  sorted 
sub-arrays.  The  asymptotic  complexity  of  the  sort  is  therefore  given  as  the  product  of  the  number  of 
iterations  of  the  loop  body  (O  ( log  P) )  and  the  time  of  each  in-place  merge  step  (ranging  fi*om  0(1) 
up  to  O  ( logP)  and  thus  O  ( logP)  on  average).  The  assonptotic  complexity  is  O  ( log^  P) . 

bitonic  shzures  the  property  of  bubble  and  rowcol  that  data  are  sorted  in-place  without  re¬ 
quiring  additional  intermediate  storage.  However,  of  the  three  sorts,  bitonic  alone  possesses  the 
property  that  its  runtime  does  not  depend  on  the  original  permutation  of  the  data  to  be  sorted. 

5.5.6  Matrix  Multiply  (matmul) 

Matrix  multiplication  is  a  classic  calculation-intensive  data-paraUel  problem,  matnul  multiplies  two 
P  X  P  matrices  to  yield  a  third  P  x  P  matrix,  tnatmul  uses  a  P-element  linear  eirray,  wherein  each 
PE  calculates  one  column  of  the  result,  matxnul  is  discussed  in  detail  in  Appendix  C. 
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5.5.7  Mesh  Sobel  Filter  (sobel) 

The  Sobel  filter  is  a  simple  edge-detection  operation  on  gray-scale  images  [2](p.76).  At  each  pixel  in 
the  input  image,  local  gradients  are  calculated  in  the  horizontal  and  vertical  directions,  and  the  value 
of  the  output  image  at  each  pixel  is  the  square-root  of  the  summed  squares  of  the  local  gradients. 

sobel  runs  on  a  mesh  containing  one  PE  per  pixel,  wherein  each  PE  calculates  a  weighted  sum 
of  a  neighborhood  of  pixel  values.  Repetition  of  an  instruction  sequence  occurs  only  in  the  integer 
square-root  calculation  at  the  end  of  the  program.  The  square-rooting  works  by  successive  refinement 
of  an  initial  estimate  until  the  value  converges  within  an  error  threshold;  this  calculation  iterates  a 
data-dependent  number  of  times,  imtil  all  PEs’  values  have  converged. 

5.5.8  Linear  Array  Median  Filter  (median) 

Median  filtering  replaces  each  pixel  of  a  P  x  P  image  by  the  median  of  its  local  (3  x  3)  neighborhood, 
median  performs  the  computation  on  a  P-element  linear  array,  wherein  each  PE  calculates  a  column 
of  the  output  image,  following  the  scan-line  algorithm  presented  in  [39](Fig.5.3)'*. 

median  has  the  most  complex  loop  structxrre  of  the  8  sample  problems,  comprising  3  repeated 
instruction  sequences  used  for  small  numbers  of  iterations  in  an  interleaved  manner.  The  3  repeated 
instruction  sequences  are  used  to  sort  3  pixel  values,  to  skip  the  least  value  (by  updating  pointers) 
from  among  3  sorted  lists  of  values,  and  to  select  the  output  pixel  value  as  the  least  remaining  in  the 
three  lists. 

5.5.9  Summary  of  Program  Characteristics 

Figure  5.2  summarizes  the  sample  programs’  characteristics  relevant  to  I-cache  speedup. 

5.6  p-Sets  for  I-Cache  Speedup  Bounds 

A  p-set  of  the  form  {iV,  1,1, 1,1}  characterizes  a  SIMD  computer  wherein  the  maximum  operation 
rates  of  all  MCSs  other  than  global  instruction  broadcast  are  equal  to  the  highest  PE  clock  rate,  while 
the  global  instruction  broadcast  rate  is  N  times  lower  than  that.  A  given  operationally  structured 
computation  is  least  likely  to  be  subsystem-boimd  when  the  operation  rate  of  the  subsystem  in 
question  is  high,  so  a  p-set  of  the  form  (iV,  1,1, 1,1}  corresponds  to  the  least  subsystem-boundedness 
for  a  given  operation  sequence.  Therefore,  speedups  obtained  with  a  p-set  of  the  form  { A,  1 , 1 , 1 , 1 } 
represent  upper  boimds  for  speedup  obtained  with  I-cache. 

At  the  other  extreme,  a  p-set  of  the  form  {N,  N,  A,  A,  A}  characterizes  a  SIMD  computer  wherein 
the  highest  operation  rates  of  all  MCSs  are  no  higher  than  the  global  instruction  broadcast  rate, 
which  is  A  times  lower  than  the  PE  clock  rate.  A  given  operationally  structured  computation  is 
most  likely  to  be  subsystem-bound  when  the  operation  rate  of  the  subsystem  in  question  is  low,  so 
a  p-set  of  the  form  {A,  A,  A,  A,  A}  describes  a  SIMD  computer  which  shows  the  greatest  possible 
subsystem-boundedness  for  a  given  operation  sequence.  Therefore,  speedups  obtained  with  a  p-set 
of  the  form  {A,  A,  A,  A,  A}  represent  lower  boimds  for  speedup  obtained  with  I-cache. 

The  sensitivity  of  I-cache  speedup  to  p-set  ratio  suggests  that  I-cache  speedup  is  a  rough  measure 
of  the  FU-to-MCS  ratio.  In  the  basis  computer  simulated  in  the  evaluations,  the  PE  chip  contains 
local  control  only  for  MCSs  but  not  for  the  FU,  such  that  FU  operation  sequences  tend  to  be  instruc¬ 
tion  delivery-bound  but  MCS  operation  sequences  tend  not  to  be  instruction  delivery-boimd.  This 
“machine-dependent”  characteristic  of  the  bzisis  computer  leads  to  the  expectation  that,  aill  else  being 
equal,  programs  with  the  highest  FU-to-MCS  ratios  should  exhibit  the  highest  I-cache  speedups. 


^In  that  algorithm,  the  index  into  array  ad  appearing  on  the  6th  line  up  from  the  bottom  should  be  dpti ,  not  aptx. 
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Figure  5.2:  The  sample  programs  have  diverse  characteristics.  (For  inter-PE  communication  topology, 
“R”  denotes  routed  inter-PE  communications,  “L”  denotes  linear  array,  and  “M”  denotes  mesh.) 
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5.7  Speedups  for  Multi-Clock  SIMD  Computers 

The  local  controller  of  a  multi-clock  SIMD  computer  is  the  same  as  the  local  controller  of  an  I-cached 
SIMD  computer,  with  the  exception  of  the  absence  of  the  cache  mechanism  itself  Specifically,  the 
multi-clock  local  controller  contains  a  multi-dock  generator  that  regulates  each  subsystem  at  its 
maximum  rate.  The  multi-dock  local  controller  also  contains  a  means  of  adapting  globally  broadcast 
instructions  for  single  re-broadcast  within  the  PE  chip  in-phase  with  the  PE  clock. 

A  multi-clock  SIMD  computer  should  be  somewhat  faster  than  its  generic  counterpart,  because 
MCSs  are  no  longer  necessarily  rate-limited  by  the  system  dock.  A  multi-step  MCS  operation  may 
take  place  at  a  higher  rate  in  a  multi-clock  SIMD  computer  than  in  a  generic  SIMD  computer. 

Table  5.1  shows  the  speedup  bounds  for  a  multi-clock  SIMD  computer  for  the  various  sample 
problems. 
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Table  5.1:  Speedups  on  a  Multi-Clock  Variant  of  SIMD-D  at  a  range  of  Values  for  pb-  Speedup  upper 
bounds  are  obtained  with  p-set  {pb)l»l>l.l}>  aiid  speedup  lower  bounds  are  obtained  with  p-set 
{/3b»  Pb)  Ph,  Pb}-  These  values  show  very  modest  speedups  for  multi-clocking  alone. 

Table  5.1  shows  that,  as  a  lower  bovmd,  there  is  no  speedup  for  any  of  the  sample  programs.  This 
fact  is  not  surprising,  because  the  lower  bound  is  obtained  when  the  MCS  clocks  are  all  as  slow  as 
the  system  clock.  In  other  words,  when  the  p-set  is  {pb»pb.Pb»Pb.Pb}>  the  MCSs  are  not  artifidally 
rate-limited  by  the  system  clock. 

The  multi-dock  speedup  upper  bounds  for  the  sample  programs  range  between  factors  of  1  and 
2,  even  at  pb=16.  That  there  is  some  speedup  at  the  limiting  p-set  {pb, 1,1, 1,1}.  indicates  that 
the  computation  is  subsystem-bound  by  some  subsystem  that  becomes  faster  relative  to  the  system 
clock  as  Pb  increases.  In  most  cases,  there  is  no  increase  in  speedup  beyond  pb=2.  This  observation 
indicates  that,  after  having  made  the  bounding  MCS  twice  as  fast  relative  to  the  system  clock,  the 
computation  reverts  to  being  subsystem-bound  by  global  instruction  broadcast. 

In  comparison  to  I-cache  speedups,  the  speedups  from  multi-clock  SIMD  computers  shown  in 
Table  5.1  are  negligible  in  most  cases.  This  observation  confirms  that  most  of  the  speedup  attributed 
to  I-cache  is  in  fact  due  to  I-cache  and  not  merely  to  the  presence  of  the  multi-clock  apparatus  in  the 
PE  chip’s  local  controller. 


5.8  I-Cache  Speedup  Bounds 

A  complete  set  of  speedups  measurements  for  each  of  the  8  sample  problems  on  each  of  the  4  SIMD 
computer  variants  are  provided  in  Appendix  E.  For  each  problem,  there  are  four  graphs  shown  on 
two  consecutive  pages  of  Appendix  E,  one  per  SIMD  computer  variant.  Plotted  on  each  graph,  over 
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a  range  of  ph=l ...  16  are  the  upper  and  lower  bounds  for  the  Fo  and  F2  I-cache  speedups.  The  upper 
bound  is  the  speedup  attained  with  />-set  {8, 1 , 1 , 1 , 1 },  wherein  the  rate  of  the  clock  controlling  each 
MCS  equals  the  maximum  operation  rate  attainable  within  the  PE  chip.  The  speedup  lower  bound 
is  attained  with  p-set  {8, 8, 8, 8, 8},  wherein  each  MCS  is  regulated  at  the  rate  of  globed  instruction 
broadcast.  The  I-cache  speedup  obtained  with  any  other  p-set  hes  between  these  two  bounding  curves. 
Superimposed  on  the  measured  data  is  a  “simple-equivalent”  speedup  curve,  whose  significance  is 
discussed  in  Section  5.10. 

This  section  discusses  I-cache  speedup  bounds  measxired  on  SIMD-D.  The  range  of  cache  sizes 
required  to  obtain  the  I-cache  speedups  is  shown  as  a  range  in  the  text  accompanying  each  graph. 
Larger  values  in  a  range  of  cache  sizes  occur  for  larger  values  of  pb. 
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tree  on  SIMD-0 


Figure  5.3:  I*Cache  Speedup  Bounds  for  tree  on  SIMD-D 


5.8.1  tree 

Figure  5.3  shows  the  measured  speedup  bounds  for  tree.  The  required  cache  size  lay  in  the  range 
20... 23. 

The  loop  structure  of  tree  is  very  similar  to  that  of  the  program  sinple  introduced  in  Section  4.4, 
consisting  of  a  short  prolog  followed  by  single  iterated  loop.  The  difference  between  the  F©  speedup 
upper  bound  and  the  F2  speedup  upper  boimd  in  Figure  5.3  illustrates  the  severity  of  the  quantization 
effect  for  non-iterating  I-cache  variants.  There  is  very  httle  difference  between  the  lower  boimds  for 
the  two  I-cache  variants,  indicating  that  the  computation  is  subsystem-boimd  for  p-sets  of  the  form 
{Pb,Pb,Pb,Pb,/^}- 

Problems  such  as  tree-siunmation  are  generally  expected  to  be  inter-PE-communication-bound, 
because  of  the  small  amount  of  calculation  to  be  performed  (an  addition)  per  inter-PE  communication 
operation.  In  part,  the  surprisingly  significant  I-cache  speedups  apparent  in  Figure  5.3  are  due  to  the 
sequence  of  address  calculations  and  context  management  operations  that  are  associated  with  the 
inner  loop’s  communication  step.  Such  calculation-intensive  instruction  sequences  are  made  faster 
with  I-cache,  which  is  why  the  upper  boimds  are  high.  However,  when  the  communication  operation 
is  of  long  duration  and  it  overlaps  the  calculations,  speeding  up  the  calculations  does  not  decrease 
the  time  to  execute  the  loop  body,  so  there  is  little  I-cache  speedup. 
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Figiire  5.4:  I-Cache  Speedup  Bounds  for  scan  on  SIMD-D 


5.8.2  scan 

Figure  5.4  shows  the  measured  speedup  bounds  for  scan.  The  required  cache  size  lay  in  the  range 
27... 35. 

The  results  for  scan  are  very  similar  to  those  for  tiree.  This  similarity  is  not  surprising,  because 
their  loop  structiures  are  very  much  alike.  Whereas  tz«e  contains  a  single  loop,  scan  has  two  loops. 
However,  the  executions  of  the  two  loops  occur  one  after  the  other,  so  the  loop  bodies  do  not  conflict 
in  cache  memory.  More  simple  loops  means  a  greater  number  of  repeat  instructions,  which  is  why 
the  I-cache  speedups  for  scan  are  slightly  higher  than  those  measured  for  tree. 
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bubble  on  SIMD-D 


Figure  5.5:  I-Cache  Speedup  Bounds  for  bubble  on  SIMD-D 


5.8.3  bubble 

Figure  5.5  shows  the  measured  speedup  boimds  for  bubble.  The  required  cache  size  lay  in  the  range 
28... 53. 

The  iteration  coxmt  for  the  inner  loop  of  bubble  is  globally  data-dependent  and  varies  between 
1  and  N  -1.  The  local  controller  of  an  F2  I-cache  variant  lacks  the  ability  to  make  data-dependent 
iteration  decisions.  How  then  to  use  in  F2  I-cache,  wherein  a  fixed  number  of  iterations  of  the  cached 
inner  loop  body  are  activated  for  a  single  globally  broadcast  cache-control  instruction  activating  the 
cache  block? 

The  F2  management  for  which  re.stilts  are  shown  is  to  activate  10  iterations  of  the  cached  inner 
loop  body  at  a  time.  The  original  program  tests  for  completion  of  the  sort  after  every  iteration.  The 
computation  is  complete  when  the  data  is  everywhere  locally  ordered.  Completion  detection  uses  a 
response  operation,  by  which  the  PEs  signal  completion  to  the  system  controller.  The  time  for  the 
response  network  to  settle  and  for  the  system  controller  to  make  a  branch  decision  is  considerable. 
The  static  I-cache  management  choice  for  bubble  for  F2  represents  an  algorithm  change,  because 
completion  tests  occur  once  every  10  iterations  instead  of  every  iteration.  This  change  means  that 
the  inner  loop  may  be  executed  up  to  9  too  many  times,  but  this  slight  inefficiency  is  compensated 
by  the  increased  rate  at  which  the  groups  of  10  iterations  execute.  The  fact  that  the  F2  speedup  is 
greater  than  1  even  when  /Jb®l  suggests  that  the  generic  SIMD  computation’s  throughput  would  be 
improved  by  unrolling  the  inner  loop,  perhaps  by  a  factor  of  10. 
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rowed  on  SIMO-0 


Figure  5.6:  I-Cache  Speedup  Bounds  for  rowcol  on  SIMD-D 


5.8.4  rowcol 

Figure  5.6  shows  the  measured  speedup  boxmds  for  rowcol.  The  required  cache  size  lay  in  the  range 
32 . . .  53. 

As  is  the  case  for  bubble,  the  individual  row  and  column  sorts  each  require  a  data-dependent 
number  of  iterations.  In  an  attempt  to  take  advantage  of  F2’s  iteration  capability,  the  loop  bodies 
have  been  “unrolled”  by  a  factor  of  4  in  the  F2  program  variant.  The  inferior  performance  of  the 
F2  I-cache  apparent  in  Figure  5.6  illustrates  that,  if  improperly  managed,  an  F2  I-cache  variant 
performs  less  well  than  an  Fq  I-cache  variant. 
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bitonic  on  SIMD-D 


Figure  5.7:  I-Cache  Speedup  Bounds  for  bitonxc  on  SIMD-D 


5.8.5  bitonic 

Figure  5.7  shows  the  measured  speedup  botinds  for  bitonic.  The  reqtiired  cache  size  lay  in  the 
range  50 . . .  53. 

The  number  of  iterations  of  the  inner  loop  of  bitonic  varies  on  each  iteration  of  the  outer  loop  as 
a  function  of  the  outer  loop  index  and  the  input  data  set  size.  The  basis  computer’s  system  controller, 
described  in  Appendix  A,  lacks  the  capability  to  specify  dynamically  calculated  iteration  counts  for 
cache  block  activation.  A  program  using  an  F2 1-cache  specifies  a  single,  fixed  number  of  iterations 
for  a  cache  block  to  be  executed  each  time  it  is  activated.  This  limitation  makes  it  impossible  to 
take  advantage  of  F2  I-cache  for  a  program  like  bitonlc,  wherein  inner  loop  iteration  counts  vary 
from  activation  to  activation.  Therefore,  results  obtained  for  F2  I-cache  are  identical  to  those  for  Fq 
I-cache.  For  clarity,  only  one  set  of  speedups  is  plotted  in  Figure  5.7. 

Although  clearly  sufiering  firom  quantization,  the  speedup  upper  bound  is  large.  The  speedup 
lower  bound  increases  steadily  as  pb  increases,  suggesting  that  even  with  the  lowest  MCS  clock  rates, 
there  is  some  room  for  improvement  fix)m  faster  instruction  delivery  using  I-cache. 
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matmul  on  SIMD-D 


Figure  5.8:  I-Cache  Speedup  Bounds  formatmol  on  SIMD-D 


5.8.6  matmul 

Figure  5.8  shows  the  measured  speedup  bounds  for  matmul.  The  required  cache  size  lay  in  the  range 
10. ..19. 

The  upper  bound  speedups  for  matmul  are  the  best  demonstration  of  the  potential  severity  of  the 
quantization  effect.  The  Fq  speedup  is  identical  to  the  F2  speedup  at  pb~5,  but  the  Fq  speedup  is  flat 
after  that  value  of  py,.  The  difference  is  due  solely  to  quantization.  Presumably  at  some  pb  >  16,  the 
Fo  speedup  would  jump  up  to  meet  the  F2  speedup  again. 

The  F2  speedup  upper  bo\ind  for  this  calculation-intensive  data-parallel  problem  scales  linearly 
with  pb. 


5.8.  hCACHE  SPEEDUP  BOUNDS 
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sobel  on  SIMD-D 


Figure  5.9: 1-Cache  Speedup  Bounds  for  sobel  on  SIMD-D 


5.8.7  sobel 

Figure  5.9  shows  the  measxired  speedup  boiuids  for  sobel.  The  required  cache  size  was  35  at  all 
points. 

sobel  comprises  a  weighted  siun  that  is  performed  at  each  pixel  of  an  image.  On  a  mesh- 
connected  SIMD  computer,  wherein  there  is  a  unique  PE  for  each  image  pixel,  the  sum  calculation 
itself  is  not  iterated.  The  only  repeated  instruction  sequences  arise  in  the  root-sum-square  calculation 
at  the  end  of  the  program.  The  number  of  iterations  of  the  square-root  estimate-refinement  loop  is 
data-dependent.  The  static  management  decision  for  F2  was  to  iterate  individual  activations  of  cache 
block  corresponding  to  the  square-root  loop  body  4  at  a  time.  The  F2  speedups  plotted  in  Figure  5.9 
show  that  this  slight  algorithm  change  has  a  large  impact,  even  yielding  a  small  improvement  at 
Pb*l- 

The  small  differences  between  upper  and  lower  speedup  bounds  confirm  that  the  iterated  part  of 
the  program  is  extremely  calculation-intensive.  The  small  total  iteration  coiint  and  the  slow  response 
operation  used  to  detect  completion  keep  the  I-cache  speedups  modest  for  sobel. 
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median  on  SIMD-O 


Figure  5.10:  I>Cache  Speedup  Bounds  for  median  on  SIMD-D 


5.8.8  median 

Figure  5.10  shows  the  measured  speedup  bounds  for  median.  The  required  cache  size  lay  in  the 
range  39 . . .  67. 

The  measiued  speedups  are  relatively  low  because  of  the  complex  loop  structiure  of  the  assembly 
language  program.  There  are  three  repeated  instruction  sequences  nested  within  the  outer  loop, 
to  sort  values  into  local  lists  and  to  manipulate  those  lists  to  find  the  median.  Executions  of  these 
sequences  alternate  for  small  numbers  of  iterations,  so  they  are  fi:equently  re-stored  in  Fq  and  in  F2 
I-caches,  capable  of  storing  but  a  single  block  at  a  time,  median  is  an  attractive  candidate  for  I-cache 
variants,  including  Fi ,  F3,  and  F7,  capable  of  storing  multiple  blocks  at  once. 


3.9.  RESULTS  SUMMARIES 
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5.9  Results  Summaries 

Section  3.8  concludes  that  8  is  a  conservative  estimate  for  />b  in  existing  SIMD  computers,  and  that 
Pb=8  is  not  unrealistically  large  for  computers  with  relatively  scalable  board-level  designs.  The  range 
of  Fo  speedups  for  each  of  the  sample  programs  at  pb~8  is  shown  graphically  in  Figure  5.11,  and  the 
corresponding  range  of  F2  speedups  is  shown  in  Figure  5.12.  In  each  of  the  summary  graphs,  the 
speedup  upper  bound  is  obtained  at  />-set  {8, 1,1, 1,1},  while  the  lower  bound  is  obtained  at  p-set 
{8, 8, 8, 8, 8}.  The  plotted  point  for  each  program  on  each  graph  is  the  I-cache  speedup  obtained  at 
p-set  {8, 8, 8, 4, 2). 
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Figure  5.11:  Summary  of  Fo  speedups  at  pb*8. 


5.10  **Simple-Equi valent”  Speedups 

The  quantization  evident  in  the  I-cache  speedups  means  that  picking  values  at  pb=8  is  potentially 
misleading.  Some  way  of  smoothing  the  results  is  need,  so  that  it  would  be  possible  to  summarize 
an  I-cache  speedup  curve  using  a  single  parameter,  or  perhaps  two  parameters.  It  is  safer  to  use 
the  value  of  a  curve  that  fits  the  data  at  pb=8  than  it  is  to  use  the  raw  measured  point,  because  the 
curve’s  value  reflects  information  firom  aU  16  points,  not  just  1. 

To  this  end,  reconsider  the  program  structure  simple  introduced  in  Section  4.4: 

pxograa 

B 

for  j  =  1  to  J  do 

A 

and; 

•nd  slapla; 

The  S3nnbols  A  and  B  in  program  sinple  denote  sequences  of  instructions.  Where  the  length  of 
sequence  A  is  denoted  A  and  the  length  of  sequence  B  is  denoted  B,  the  time  Tg  to  run  sinple  on  a 
generic  SIMD  computer  is  given  as 
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F2  I-cache  on  SIMD-D 


Figure  5,12:  Siunmary  of  F2  speedups  at  />b*8. 


Tg  »  B  *  JA  cycles 

The  corresponding  best  time  Te  on  an  ideal  I-cache  is  ^ven  as 


Tc  -  B  *  A*  J —  (7cles 
Pb 

The  speedup  is  the  ratio  of  these  times: 


I-cache  speedup  *  ^ 

B*JA 


(5.1) 


Dividing  both  top  and  bottom  in  Equation  5.1  by  CA  +  B)  yields  an  ugly  variant  of  the  speedup 
equation: 


1  +  LL-n>t 

I-cache  speedup  *  — — — 

^  *  (A+B)pb 


(5.2) 


There  is  a  method  to  this  madness:  Let  9--^g  represent  the  proportion  of  the  program’s  instruc¬ 
tions  that  are  cachable  (0  <  <  1).  Then  the  speedup  equation  becomes 


I-cache  speedup  * 


I*  gJ  -9 


_  (1  *  gJ)  Ph  _  9Ph 
Ph*9J  Ph*9J 


(5.3) 


5.10.  ^SaiPLE-EQUIVALENT^  SPEEDUPS 
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3.3 
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4.0 
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15.9 
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7.9 
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3.8 
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.415 

median 
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4.0 

3.3 
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.104 

Figure  5.13:  Svunmary  of  I*Cache  Speedup  Curve  Parameters,  Speedups  at  Pb=8,  and  Cache  Size  at 
Pb*8  on  SIMD-A. 

For  programs  for  which  I-cache  speedup  can  he  expected  to  be  greatest,  the  product  gJ,  represent¬ 
ing  the  product  of  the  firaction  of  cachable  instructions  and  the  cache  block  iteration  coimt,  should 
be  large.  When  gJ  is  very  large,  then  the  term  is  nearly  0,  and  the  I-cache  speedup  may  be 
approximated  as 

I-cache  speedup  «  — ^  (5.4) 

Ph*gJ 

If  we  let  C^gJ  represent  the  product  of  the  proportion  of  repeat  instructions  and  the  loop  iteration 
count,  then  we  have  the  “simple-equivalent”  I-cache  speedup  function: 

simple-equivalent  (1  OPb  (55) 

I-cache  speedup 

Equation  5.5  has  a  single  parameter,  C.  Fitting  Equation  5.5  to  the  measured  I-cache  speedup 
curves  yields  a  single  parameter  that  characterizes  the  curve.  Appendix  E  plots  the  measured  data 
and  superimposes  the  result  of  a  least-squares  fit  of  Equation  5.5  over  each  data  set.  The  fits  are 
excellent,  except  in  the  case  of  F2  used  for  rowcol.  Note  ti.at  the  value  of  Equation  5.5  is  never 
less  than  1.  In  the  case  of  F2  used  for  rowcol,  poor  static  I-cache  management  caiises  speedups  to 
be  mostly  less  than  1,  so  the  fits  with  Equation  5.5  are  poor.  For  that  computation,  the  error  term 
involving  g  caimot  be  ignored. 

The  following  set  of  four  tables,  one  per  SIMD  computer  variant,  show  the  C  resulting  fix)m  the 
curve  fit  as  shown  in  Appendix  E  for  both  I-cache  variants  at  both  bounding  p-sets,  along  with  the 
simple-equivalent  I-cache  speedup  corresponding  to  that  value  of  C.  For  F2  I-cache  on  rowcol,  the 
parameter  g  is  also  shown.  Also  shown  are  the  cache  sizes  required  to  obtain  the  speedup  upper 
boimd.  The  value  of  7  in  the  last  column  of  each  table  is  obtained  by  substituting  the  reqviired  cache 
size  for  N  in  Equation  4.9,  and  then  substituting  the  resulting  value  for  6  into  Equation  4.6,  along 
with  the  values  for  J  and  Ag  appearing  in  Figure  3.2. 
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Figure  5.14:  Summary  of  I-Cache  Speedup  Curve  Parameters,  Speedups  at  Pb=8,  and  Cache  Size  at 
Pb=8  on  SIMD-B. 


Figure  5.15:  Summary  of  I-Cache  Speedup  Curve  Parameters,  Sjreedups  at  />b=8,  and  Cache  Size  at 
Pb=8  on  SIMD-C. 


Figure  5.16:  Summary  of  I-Cache  Speedup  Curve  Parameters,  Speedups  at  Pb=8,  and  Cache  Size  at 
Pb=8  on  SIMD-D. 


5.11.  MAXIMUM  LCACHE  SPEEDUP:  F-,  ESTIMATES 
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5.11  Maximum  I-Cache  Speedup:  F?  Estimates 

Looking  at  the  fairly  large  speedups  shown  in  Figures  5.3  through  5.10,  one  wonders  how  the 
speedups  obtained  using  the  simple  I-cache  variants  compare  with  the  maximum  possible  speedup 
for  each  of  the  sample  problems. 

The  best  possible  I-cache  speedup  would  be  obtained  using  a  large,  complex  multi-port  I -cache 
variant,  wherein  all  repeat  instructions  fit  in  cache  memory,  all  iteration  is  controlled  within  the 
PE  chip,  and  some  of  the  cache  store  time  for  instructions  overlaps  with  execution  of  other  cache 
blocks.  Because  of  the  characteristic  simplicity  of  their  loop  structures,  the  sample  programs  do  not 
present  much  opportunity  to  exploit  prefetching.  For  the  sample  programs,  nearly  the  maximum 
possible  I-cache  speedup  would  be  obtained  using  an  F7  I-cache  variant  that  has  sufi&dent  capacity 
to  store  at  one  time  all  repeated  instruction  sequences.  As  discussed  in  Section  4.2,  F7  is  the  most 
complex  member  of  the  F-family  of  single-port  I-cache  variants.  The  program-control  component 
of  an  F7  I-cache  may  be  as  complex  as  the  system  controller's  program-control  component.  An  F7 
I-cache  is  able  to  store  and  sequence  entire  programs,  whose  executions  may  involve  loop  nesting 
and  data-dependent  branching.  Unlike  the  very  simple  F©  and  F2  I-cache  variants,  the  F?  I-cache 
variant  yielding  the  maximum  speedup  may  well  occupy  substantial  chip  area  inside  the  PE  chip. 

Estimates  for  F7  speedups  are  obtained  through  a  static  analysis  of  assembly  language  programs 
for  the  sample  problems.  The  method  is  to  run  each  program  directly  through  a  modified  version  of  the 
assembler,  without  any  re-programming  for  I-cache.  The  modified  assembler  schedules  all  repeated 
instruction  sequences  as  if  they  were  in  cache,  assuming  an  arbitrarily  large  cache  size.  Then  the  run 
time  of  the  resulting  computation  is  estimated  by  cycle  coimting.  A  repeated  instruction  sequence 
contributes  an  amount  of  time  equal  to  the  loop  iteration  count  times  the  length  of  the  sequence 
divided  by  ph,  plus  an  additional  amount  of  time  equal  to  the  number  of  instructions  in  the  sequence 
which  is  required  to  store  the  sequence  in  cache.  This  method  ignores  the  re-storing  that  needs  to 
be  performed  when  cache  size  is  exceeded,  it  ignores  all  quantization  effects,  and  it  ignores  the  time 
spent  globally  broadcasting  cache-control  instructions  that  are  added  to  programs  with  I-cache. 

It  is  important  to  be  aware  of  the  significant  difference  between  these  F7  speedup  estimates 
and  the  speedup  measurements  that  are  the  basis  of  I-cache  evaluation.  The  Fo  and  F2  speedup 
measurements  are  based  on  detailed  designs,  they  accoimt  for  quantization  effects  and  cache-control 
overheads,  and  most  importantly,  they  are  simulated  to  verify  their  correctness.  The  F7  speedups 
are  compiler  estimates  only,  rather  than  measurements  taken  firom  the  simulator. 

Bearing  this  caveat  in  mind,  it  is  interesting  to  compare  the  speedup  measurements  obtained 
for  the  simple  I-cache  variants  against  estimates  for  the  best  possible  F7  speedups.  Figmes  5.17 
through  5.24  show  estimates  for  F7  speedups  at  the  ;>-sets  that  give  upper  and  lower  boimds.  The 
F7  speedup  estimates  are  plotted  along  with  F2  speedup  curves,  except  for  the  programs  with  data- 
dependent  iteration  counts.  For  those  programs,  the  static  management  of  F2  I-cache  amounts  to  an 
algorithm  change,  which  change  makes  comparison  between  F7  and  F2  unfair.  For  the  programs  with 
data-dependent  branching,  the  F?  speedup  estimates  are  plotted  with  the  Fq  speedup  measurements 
for  comparison  purposes. 

Where  iteration  coimts  are  moderate,  as  for  tree  and  scan,  F7  speedup  estimates  tail  over 
significantly,  and  F2  achieves  close  to  this  maximum  speedup. 

For  the  data-dependently  branching  programs,  including  bubble,  zowcol,  and  sobel,  the  esti¬ 
mate  for  the  best  F7  speedup  is  low.  The  reason  is  that  each  iteration  of  the  inner  loop  includes  a 
response  operation,  whose  time  is  typically  long  and  not  appreciably  shortened  with  I-cache.  When 
a  significant  fraction  of  the  computation  cannot  be  made  faster  with  I-cache,  I-cache  speedups  are 
low.  The  difference  between  the  Fo  speedups  and  the  F7  estimates  for  these  problems  is  due  to 
quantization  and  the  cache-control  overheads. 

The  three  remaining  problems,  bxtonic,  natmul,  and  median,  all  have  high  iteration  counts 
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and  no  data-dependent  branching.  These  are  ideal  conditions  for  large  I-cache  speedups,  and  indeed, 
the  corresponding  estimates  for  F7  speedup  are  high.  For  the  first  two  of  these  programs,  F2  achieves 
roughly  75%  of  the  F7  estimate,  suffering  fix)m  quantization  and  time  spent  in  cache-control  instruc¬ 
tions.  For  median,  however,  F2  speedup  is  well  below  the  maximum.  This  disparity  highlights  the 
principal  shortcoming  of  an  F2  I-cache  variant,  namely  that  it  contains  only  one  block  at  a  time.  The 
assembly  language  program  for  median  (shown  in  Figure  D.8)  contains  several  repeated  instruction 
sequences  whose  executions  alternate  dxiring  the  computation.  A  single-block  I-cache  variant  such 
as  F2  must  re-store  each  cache  block  before  it  can  be  used  The  small  iteration  counts  for  which 
each  cache  block  is  repeated  per  activation  in  median  mean  that  the  time  spent  re-storing  is  not 
amortized  over  a  large  number  of  iterations,  median  is  one  program  which  demands  an  F3  or  higher 
I-cache  variant,  capable  of  storing  multiple  cache  blocks  at  once. 

Of  course,  the  maximum  F7  speedup  is  determined  by  program  characteristics.  The  F7  speedup 
estimates  therefore  provide  one  way  of  characterizing  programs.  For  example,  the  spread  between 
the  upper  and  lower  speedup  bounds  for  F7  apparent  in  Figures  5.17  through  5.24  provides  a  rough 
measure  of  the  MCS-intensiveness  of  each  program.  With  the  exception  of  sobel,  all  of  the  sample 
programs  are  MCS-in tensive,  as  illustrated  by  large  spreads  between  upper  and  lower  F7  speedup 
bounds  estimates.  In  light  of  this  observation,  it  is  somewhat  surprising  that  the  simple  I-cache 
speedup  lower  bounds  lie  in  the  range  of  30%  to  100%  for  these  programs  when  /)b=8- 
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free  on  SIMD-D 


Figtire  5.17:  Ideal  I-Cache  Compared  with  F2  for  tree  on  SIMD-D 


scan  on  SIMD-D 


Figure  5.18:  Ideal  I-Cache  Compared  with  F2  for  scan  on  SIMD-D 


speedup  Relative  to  Generic  SIMD  Computer  Speedup  Relative  to  Generic  SIMO  Computer 
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bubble  on  StMO-D 


Figure  5.19:  Ideal  I-Cache  Compared  with  Fo  for  bubbla  on  SIMD-D 

rowool  on  SiMD>D 


Figure  5.20:  Ideal  I-Cache  Compared  with  Fq  for  rowcol  on  SIMD-D 


Speedup  Relative  to  Generic  SIMO  Computer 
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Figure  5.21;  Ideal  I-Cache  Compared  with  for  bitonic  on  SIMD-D 
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Figure  5.22:  Ideal  I-Cache  Compared  with  for  natmul  on  SIMD-D 


Speedup  Relative  to  Generic  SIMD  Computer  Speedup  Relative  to  Generic  SIMD  Computer 
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Figure  5.23:  Ideal  I-Cache  Compared  with  Fq  for  sobal  on  SIMD-D 


median  on  SIMO-D 


Figure  5.24:  Ideal  I-Cache  Compared  with  F2  for  median  on  SIMD-D 


5.12.  SENSITIVITY  OF  I-CACHE  SPEEDUP  TO  INTER-PE  COMMUNICATION 
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5.12  Sensitivity  of  I-Cache  Speedup  to  Inter-PE  Communication 

How  sensitive  is  I-cache  speedup  to  inter-PE  communication  intensiveness?  The  way  to  investigate 
the  effect  of  inter-PE  communication  intensiveness  is  to  vary  the  stepcount  parameters  of  inter-PE 
communication  operations  for  each  of  the  subject  programs.  The  stepcount  of  an  operation  is  the 
number  of  subsystem  clock  cycles  taken  to  perform  the  operation. 

A  reasonable  expectation  is  that  as  inter-PE  communication  stepcounts  increase,  a  greater  pro¬ 
portion  of  the  total  computation  time  is  spent  in  inter-PE  communication.  Because  MCS  operations 
tend  not  to  be  instruction  delivery-boimd  in  the  basis  computer  used  in  I-cache  evaluations,  I-cache 
speedup  would  drop  off  as  the  inter-PE  communication  intensiveness  drops  off 

The  actual  effect  of  increased  inter-PE  communication  intensiveness  is  siuprisingly  subtle.  Fig¬ 
ures  5.25  through  5.32  show  how  I-cache  speedups  vary  versus  the  number  of  clock  cycles  required 
to  perform  inter-PE  communication  operations  at  a  few  different  p-sets.  The  graphs  show  that 
sometimes  the  I-cache  speedup  actually  increases  with  the  inter-PE  communication  intensiveness. 

The  reason  for  this  surprising  result  is  that  the  effect  of  increased  inter-PE  communication  on 
the  I-cached  SIMD  computation  is  sensitive  to  the  /!>-set.  Increasing  the  inter-PE  communication 
stepcount  by  1  increases  the  generic  SIMD  computation  time  by  one  system  clock  cycle  per  each 
inter-PE  communication  operation  performed  in  the  computation.  However,  the  increase  in  the  time 
of  the  I-cached  computation  is  ^  system  dock  cydes  per  communication  operation,  because  the 

inter-PE  communication  subsystem  dock  rate  is  ^  times  greater  than  the  system  clock  rate.  When 
inter-PE  communication  steps  are  faster  than  global  instruction  broadcasts,  increasing  the  proportion 
of  inter-PE  commiinication  in  a  computation  actually  increases  the  I-cache  speedup.  For  example, 
when  ^=8,  as  at  p-set  {8, 8,1,1,!},  each  inter-PE  communication  step  occurs  8  times  faster  in  the 
I-cached  SIMD  computation  than  in  the  generic  SIMD  computation.  At  the  other  extreme  in  relative 
rates,  when  ioter-PE  communication  occurs  at  the  same  rate  as  global  instruction  broadcast,  the  time 
added  by  increased  inter-PE  communication  is  roughly  the  same  in  the  generic  and  I-cached  SIMD 
computations.  For  example,  when  ^=1,  as  at  p-set  {8, 8, 8, 8, 8),  each  inter-PE  commimication  step 
occurs  at  the  same  rate  in  the  I-cached  SIMD  computation  as  in  the  generic  SIMD  computation.  While 
flow-dependendes  may  allow  I-cache  to  overlap  other  operations  with  the  inter-PE  commimication, 
the  tendency  of  increased  inter-PE  communication  is  to  add  the  same  amoimt  of  time  to  generic  and 
I-cached  computations,  thus  reducing  the  I-cache  speedup. 

The  graphs  in  Figures  5.25  through  5.32  plot  I-cache  speedup  versus  inter-PE  communication 
stepcoimt  measured  for  machine  variant  SIMD-D.  The  graphs  illustrate  the  p-set  dependence  of 
the  sensitivity  of  I-cache  speedup  to  inter-PE  communication  intensiveness.  The  stepcounts  for 
most  of  the  computations  vary  from  1  to  10,  which,  for  example,  reflects  a  degree  of  PE  pin  time¬ 
sharing  for  neighbor  communications  on  a  regrular  grid.  On  the  computations  using  routed  inter- PE 
communication,  the  stepcoimt  for  an  arbitrary  permutation  requires  0  ( log  P)  steps  on  P  PEs  using 
a  log-diameter  communication  topology  such  as  a  hypercube.  The  sample  computations  using  routed 
inter-PE  communications  aU  use  1  x  2^°  PEs,  for  a  nominal  inter-PE  communication  stepcount  of  20. 
The  graphs  for  these  computations  show  stepcounts  ranging  firom  15  to  25. 

One  curve  that  apparently  departs  from  the  general  description  given  above  is  the  F2  speedup  for 
bubble  at  ^-set  {8, 8, 1, 1, 1}.  That  curve,  shown  in  Figure  5.27,  rises  only  sUghtly  to  the  right,  at  a 
value  just  under  8.  This  behavior  really  is  not  surprising  in  light  of  the  fact  that  a  factor  of  Pb  is  the 
maximum  possible  I-cache  speedup,  and  that  factor  is  already  nezuly  attained  for  bubble.  At  p-set 
{8, 8, 1 , 1 , 1 } ,  each  added  inter-PE  communication  step  goes  8  times  faster  on  the  I-cache  computation. 
Each  added  communication  step  has  nearly  the  same  speedup  as  the  rest  of  the  computation,  so  the 
added  steps  barely  affect  the  overall  I-cache  speedup. 

The  marked  quantization  effect  for  znatxmxl  is  apparent  in  the  Fq  I-cache  speedup  at  p-set 
{8,8, 1,1,1}  shown  in  Kgure  5.30.  Each  time  the  inter-PE  communication  stepcount  increases 
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at  /9-set  {8, 8, 1 , 1, 1 },  the  time  to  execute  the  inner  loop’s  cache  block  increases  by  1  PE  clock  cycle. 
The  Fo  execution  time  is  constant  as  the  stepcount  increases  firom  1  to  7,  and  the  I-cache  speedup 
arises  because  the  generic  SIMD  computation  time  increases  steadily  over  that  range.  However, 
when  the  stepcount  increases  from  7  to  8,  the  Fo  cache  block’s  duration  becomes  just  long  enough 
to  require  another  system  clock  cycle,  caxising  a  j\unp  in  the  execution  time  and  a  corresponding 
marked  decrease  in  speedup. 

Most  of  the  speedups  ciirves  indicate  that  I-cache  speedup  is  far  more  sensitive  to  p-sets  than  it 
is  to  inter-PE  communication  stepcounts. 
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tree  on  SIMD-D 


jure  5.25:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcounts  for  tree 

scan  on  SIMD-D 


Figure  5.26:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcounts  for  scan 
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bubble  on  SIMD-D 


5.27:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcounts  for  bubble 


rowed  on  SIMD-D 


Figure  5.28:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcounts  for  rowcol 
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Figure  5.31:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcounts  for  sobel 
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Figure  5.32:  I-Cache  Speedups  v.  Inter-PE  Communication  Stepcoimts  for  median 
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5.1 3  Sensitivity  of  I>Cache  Speedup  to  Local  External  Memory  Access 

How  sensitive  is  I-cache  speedup  to  local  external  memory  access  intensiveness?  This  question  is 
addressed  using  the  same  technique  used  for  inter-PE  communication  sensitivity.  By  varying  the 
stepcounts  of  local  external  memory  access  instructions,  it  is  possible  to  vary  the  memory  intensive - 
ness  of  a  sample  program.  Figures  5.33  through  5.40  show  the  results. 

Of  the  eight  sample  programs,  only  three  xise  local  external  memory.  For  the  rest  of  the  programs, 
the  PE  register  file  is  sufficiently  large  to  accommodate  all  of  the  program  variables.  Of  course,  the 
I-cache  speedups  of  the  programs  that  do  not  use  local  external  memory  are  flat  versus  local  external 
memory  stepcount.  These  five  include  tree,  bubble,  rowcol,  bitonic,  and  sobel. 

Of  the  three  programs  that  do  use  local  external  memory,  seen  uses  it  only  lightly,  and  its  I-cache 
speedup  is  therefore  barely  affected  as  memory  stepcounts  change,  matoul  and  median  make  heavy 
use  of  memory.  shows  the  effects  of  quantization. 

As  for  inter-PE  communication,  the  effect  of  increased  local  external  memory  stepcounts  on  the 
I-cached  SIMD  computation  is  sensitive  to  the  p-set.  Increasing  the  local  external  memory  access 
stepcoimt  by  1  increases  the  generic  SIMD  computation  time  by  one  system  clock  cycle  per  each 
memory  access  performed  in  the  computation.  However,  the  increase  in  the  time  of  the  I-cached 
computation  is  ^  system  clock  cycles  per  access,  because  the  local  external  memory  subsystem 

clock  rate  is  “  times  greater  than  the  system  clock  rate.  When  memory  access  steps  are  faster 
than  global  instruction  broadcasts,  increasing  the  proportion  of  memory  accesses  in  a  computation 
actually  increases  the  I-cache  speedup.  For  example,  when  ^=8,  as  at  p-set  {8, 8, 1 , 1 , 1 },  each  local 
external  memory  access  step  occurs  8  times  faster  in  the  I-cached  SIMD  computation  than  in  the 
generic  SIMD  computation.  At  the  other  extreme  in  relative  rates,  when  memory  access  occurs  at 
the  same  rate  as  global  instruction  broadcast,  the  time  added  by  increasing  the  number  of  memory 
access  steps  is  roughly  the  same  in  the  generic  and  I-cached  SIMD  computations.  For  example,  when 
^=1 ,  as  at  p-set  {8, 8, 8, 8, 8},  each  memory  access  step  occurs  at  the  same  rate  in  the  I-cached  SIMD 
computation  as  in  the  generic  SIMD  computation.  While  flow-dependencies  may  allow  I-cache  to 
overlap  other  operations  with  the  memory  access,  the  tendency  of  increased  memory  usage  is  to  add 
the  same  amoimt  of  time  to  generic  and  I-cached  computations,  thus  reducing  the  I-cache  speedup. 
These  effects  are  most  pronounced  in  the  I-cache  speedups  of  median  shown  in  Figure  5.40. 
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Chapter  6 


Providing  Chip  Area  for  I-Cache 


I-cache  increases  the  chip  area  occupied  by  a  PE  chip’s  local  controller,  due  to  the  addition  of  a 
miilti-clock  generator,  cache  memory,  and  a  cache  controller.  The  I-cache  evaluation  discussed  in 
Chapter  5  presupposes  that  PE  chip  expands  as  required  to  accommodate  the  local  controller  with 
I-cache.  Increasing  chip  area  is  not  always  economically  feasible,  because  as  the  physical  dimensions 
of  a  chip  increase,  production  yield  decreases  [80].  Decreased  yield  means  that  unit  production  cost 
increases,  so  increasing  the  size  of  the  PE  chip  increases  its  cost.  A  limited  implementation  budget 
therefore  imposes  a  limit  on  the  physical  size  of  the  PE  chip. 

In  a  PE  chip  of  fixed  chip  area,  cache  size  is  limited.  Furthermore,  adding  I-cache  may  force  the  PE 
chip  payload  to  be  reduced,  through  reducing  PE  count,  PE  register  coimt,  and/or  FU  complexity.  The 
payload  should  be  reduced  in  a  way  that  maximizes  the  overall  speedup,  so  the  best  way  to  provide 
chip  area  for  I-cache  depends  on  the  resource  utilization  characteristics  of  a  given  computation. 

This  chapter  enumerates  a  variety  of  ways  to  accommodate  I-cache  in  a  PE  chip  of  fixed  physical 
size  and  examines  the  consequences  of  each.  The  chip-area  use  tradeoff  that  are  apparent  m  this 
chapter  arise  also  in  the  design  of  PE  chips  for  generic  SIMD  computers.  Here,  the  question  of  how 
best  to  use  the  available  chip  area  is  re-opened  in  light  of  the  new  requirement  to  make  room  for 
I-cache. 


6.1  Strategies  for  Providing  Chip  Area  for  I-Cache 

Figure  3.11  shows  a  floor  plan  for  a  PE  chip  that  matches  the  floor  plans  of  the  PE  chips  in  existing 
SIMD  computers.  In  a  PE  chip  of  fixed  area  and  fixed  VLSI  implementation  technique,  I-cache  is 
constrained  in  the  chip  area  it  occupies,  and  adding  I-cache  may  force  removing  other  components 
from  the  PE  chip.  Using  the  floor  plan  in  Figure  3.11  as  a  starting  point.  Figure  6.1  illustrates  a 
collection  oi Strategies  for  providing  chip  area  for  I-cache.  The  various  Strategies,  shown  in  Figure  6.1 
as  modifications  to  the  floor  plan  of  Figure  3.11,  are  as  follows: 

•  Strategy  0  is  to  provide  chip  area  for  I-cache  by  using  available  interior  area  that  has  not 
already  been  used  for  PEs.  Strategy  0  presupposes  that  in  the  generic  SIMD  computer’s  PE 
chip,  substantial  areas  of  the  chip  interior  are  iinused.  Under  Strategy  0,  chat  otherwise 
xinused  chip  area  is  occupied  by  I-cadie.  As  do  aU  of  the  other  strategies.  Strategy  0  introduces 
a  restriction  on  the  number  of  instructions  that  can  fit  in  cache  at  once.  Strategy  0  is  obviously 
the  best  way  to  provide  chip  area  for  I-cache,  because  it  does  not  reduce  the  PE  chip  payload. 
Unfortunately,  if  there  is  not  much  “spare”  interior  area  in  the  PE  chip  to  start  with,  then 
Strategy  0  may  not  allow  enough  chip  area  for  the  resulting  cache  to  be  as  large  as  required  to 
realize  the  maximum  speedup  for  a  given  computation. 
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•  Strategy  1  is  to  reduce  the  number  of  PEs  in  the  PE  chip.  For  a  fixed  PE  chip  count,  Strateg}' 
1  reduces  the  number  of  PEs  in  the  computer.  For  a  fixed  problem  size,  applying  Strategy' 
1  requires  re-structuring  the  computation  to  reflect  reduced  PE  count;  Remaining  PEs  are 
assigned  sub-problems  that  wovdd  have  been  assigned  to  the  displaced  PEs. 

•  Strategy  2  is  to  reduce  PE  register  count.  For  some  computations,  the  operational  structure 
changes  to  reflect  a  reduced  nximber  of  registers  per  PE.  Some  sub-problem  data  allocated  to 
PE  registers  in  the  generic  SIMD  computation  are  re-assigned  to  locations  in  local  external 
memory  imder  Strategy  2. 

•  Strategy  3  is  to  reduce  PE  FU  complexity.  As  a  result  of  decreased  FU  complexity,  some  FU 
operations  take  an  increased  number  of  clock  cycles  to  complete. 

6.2  Method  for  Measuring  Speedups  at  Fixed  Chip  Area 

The  I-cache  evaluation  method  illustrated  in  Figure  5.1  applies  to  computations  wherein  the  chip 
area  of  the  PE  chip  is  permitted  to  increase  to  accommodate  local  controller  expansion  with  I-cache. 
Figure  6.2  illustrates  the  method  adapted  in  the  following  ways  to  compensate  for  the  local  controller’s 
expansion  within  a  PE  chip  of  fixed  chip  area: 

•  Before  being  transformed  into  a  physically  structured  multi-clock  SIMD  computation,  the  as¬ 
sembly  language  program  describing  the  subject  generic  SIMD  computation  is  first  transformed 
to  reflect  reduced  PE  count  or  reduced  PE  register  count. 

•  In  addition  to  the  re-organization  and  addition  of  cache-control  instructions  needed  to  use  I- 
cache,  the  subject  computation’s  assembly  language  description  is  transformed  also  to  reflect 
reduced  PE  coxmt  or  reduced  PE  register  count. 

•  In  transforming  operationally  structured  computations  into  their  physically  structured  variants 
with  I-cache,  operation  stepcount  parameters  may  increase  to  reflect  reduced  FU  complexity, 
or  they  may  decrease  to  reflect  reduced  PE  chip  pin  time-sharing. 
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Figure  6.2:  Method  Adapted  for  MeasTirmg  Speedup  at  Fixed  Chip  Area 
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6.3  Speedups  Using  Strategy  0 

The  Strategy  0  case  studies  show  the  efifects  of  limited  cache  size  on  Fo  speedup.  The  following 
8  graphs  (Figures  6.3  through  6.10)  show  how  Fq  speedup  of  the  8  sample  programs  varies  with 
cache  size.  Curves  are  plotted  for  each  of  the  p-sets  {8, 8, 1,1,1},  {8,8, 8,4,2),  and  {8,8,8, 8,8).  The 
speedups  decrease  to  the  right,  as  the  chip  area  available  for  I-cache  decreases,  thus  limiting  cache 
size  increasingly  severely. 

The  graphs  illustrate  the  small  cache  sizes  of  instructions  needed  to  obtain  substantial  Fq 
speedups  for  the  sample  programs  on  SIMD  computer  variant  SIMD-D.  The  graphs  show  that  al¬ 
lowing  half  the  maximum  cache  size  required  for  a  given  program  sdelds  better  than  half  the  best 
speedup  for  that  program  at  any  ^-set.  The  graphs  also  show  that  the  rate  at  which  speedup  drops 
off  as  cache  size  is  reduced  is  greatest  for  small  cache  memories. 

The  non-monotonidties  of  the  speedups  plotted  in  Figures  6.3  through  6.10  occur  due  to  inter¬ 
actions  of  limited  cache  size  with  the  optimizations  performed  in  the  assembler/scheduler.  The  as¬ 
sembler/scheduler  breaks  each  cache  block  into  two  pieces,  one  sequence  of  instructions  that  fits  into 
cache  and  a  remaining  sequence  that  doesn’t.  Splitting  cache  blocks  in  this  way  requires  un-doing 
some  of  the  re-ordering  performed  in  earlier  passes  for  the  purposes  of  overlapping  long-dviration 
MCS  operations.  A  better  compiler  would  certainly  handle  limited  cache  size  more  gracefully,  as  there 
is  no  good  reason  that  having  more  instructions  in  cache  should  slow  down  a  SIMD  computation. 

The  graphs  show  that  for  a  given  cache  size,  speedups  are  lower  at  p-sets  {8,8, 8,4,2}  and 
{8,8, 8, 8,8}  than  at  p-set  {8, 8, 1,1,1}.  This  phenomenon  is  due  to  the  conservative  cachability  test 
apphed  in  the  assembler/scheduler.  Under  that  cachability  test,  an  instruction  is  deemed  not  to  fit 
in  the  cache  if  its  latency  (in  PE  clock  cycles)  exceeds  the  n\imber  of  available  locations  in  the  cache. 
Also,  once  an  instruction  is  excluded,  all  subsequent  instructions  in  the  cache  block  are  also  excluded. 
This  cachability  test  for  partial  cache  blocks  is  too  conservative,  as  the  I-cache  variants  evaluated 
here  allow  timing  delays  to  be  represented  compactly  in  cache  memory. 

The  Fo  speedup  for  Sobel  filter,  plotted  in  Figure  6.9,  appears  to  be  independent  of  the  p-set  used. 
This  similarity  in  the  speedup  curves  for  different  p-sets  occurs  because  the  part  of  the  program  for 
which  I-cache  is  used  is  calculation-intensive,  and  thus  instruction  delivery-bound  on  the  simulated 
model  of  SIMD  computation. 
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bubble  on  SIMD-0 


Figure  6.5:  Fq  Speedup  versus  Cache  Size  for  bubble 

rowcol  on  SIMD-D 


Fig^ure  6.6:  Fq  Speedup  versus  Cache  Size  for  rowcol 
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Monic  on  SIMO-O 


Figure  6.7:  F©  Speedup  vereus  Cache  Size  for  bitonic 


matmul  on  SIMD-D 


Figure  6.8:  Fq  Speedup  verstis  Cache  Size  for  natmul 
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sobel  on  SIMO-D 


Figure  S.9:  Fo  Speedup  versus  Cache  Size  for  sobel 
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Figure  6.10:  Fq  Speedup  versus  Cache  Size  for  median 
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6.4  Speedups  Using  Strategy  1 

Strategy  1  is  to  displace  PEs  firom  the  PE  chip.  Strategy  1  has  a  variety  of  effects  on  both  the 
operational  structure  and  the  physical  structure  of  a  computation,  depending  upon  the  context  in 
which  it  is  applied. 

One  surprising  effect  of  Strategy  1  is  that  by  reducing  the  number  of  PEs  time-sharing  MCS 
pins  in  the  PE  chip,  Strategy  1  tends  to  reduce  stepcovmts  for  MCS  operations,  including  local 
external  memory  access  and  inter-PE  communication.  All  else  being  equal,  this  effect  of  Strategy  1 
tends  increases  throughput.  Unfortunately,  not  all  else  is  equal,  because  the  computation  must  be 
restructured  to  compensate  for  the  PEs  displaced  under  Strategy  1. 

The  number  of  PEs  P  in  a  generic  SIMD  computer  is  given  as 

P=KM 


where  K  is  the  number  of  PEs  per  PE  chip  and  M  is  the  number  of  PE  chips  in  the  computer. 

When  I-cache  is  added  to  a  SIMD  computer  using  Strategy  1,  the  number  of  PEs  per  PE  chip  is 
reduced  to  a  new  number  K',  such  that  K'  <  K.  The  number  of  PEs  P'  in  the  I-cached  SIMD  computer 
is  either  less  than,  equal  to,  or  greater  than  the  original  ntimber  of  PEs  P.  The  consequences  of  these 
alternative  scenarios  are  enumerated  below; 

1.  P'  <P 

This  is  the  case  where  the  niunber  of  PE  chips  cannot  be  increased.  The  constraint  M'=M 
applies,  for  example,  where  the  total  physic^  size  of  the  computer  is  limited.  Under  this 
constraint,  it  becomes  necessary  to  re-program  K'M  PEs  to  do  the  work  of  the  original  P  PEs. 
This  reprogramming  corresponds  to  changing  the  operationally  structured  description  of  the 
computation. 

The  ratio  of  the  niunber  of  PEs  in  the  model  on  which  a  program  is  written  to  the  number  of  PEs 
in  the  computer  is  commonly  referred  to  as  the  VP-ratio  [18](p.9).  Under  Strategy  1  I-caching 
with  fixed  PE  chip  count, 

p 

VP-ratio  =  — 

K> 


The  virtual  processors  programming  abstraction  [18](Sec.3.2)  facilitates  writing  programs  as¬ 
suming  one  PE  per  data  element  without  regard  to  the  (perhaps  lower)  number  of  PEs  in  a 
computer.  A  program  written  using  the  virtual  processing  abstraction  is  translated  into  an 
operationally  structured  description  of  the  computation  in  which  the  available  PEs  share  the 
workload  evenly  [41](p.ll71). 

Strategy  1  increases  loop  iteration  counts  by  a  factor  of  [ .  While  increasing  computation 
time,  increasing  loop  iteration  coxints  actually  tends  to  increase  I-cache  speedup. 

Large  VP-ratios  place  a  high  demand  on  PE  memory  [84](p.l06),  because  the  amoimt  of  memory 
required  by  the  PE  increases  linearly  with  its  workload,  which  is  proportional  to  the  VP-ratio. 
Another  effect  of  Strategy  1,  which  occurs  when  the  required  size  of  PE  memory  increases  to  a 
point  where  it  exceeds  the  size  of  the  PE  register  file,  is  to  cause  some  PE  data  that  was  located 
in  registers  in  the  original  computation  to  be  re-located  to  local  external  memoiy.  Data  in  local 
external  memory  typically  is  accessed  at  a  lower  rate  than  data  in  registers,  so  throughput  and 
I-cache  speedup  decrease  under  this  effect.  This  effect  obtains  if  each  PE  in  the  original  generic 
SIMD  computation  has  available  less  than  f  -^1  times  the  number  of  registers  it  reqiiires. 
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By  reducing  the  number  of  PEs,  Strategy  1  reduces  the  amount  of  inter-PE  communication, 
replacing  those  commxinications  with  references  to  local  memory.  The  impact  of  this  effect 
on  I-cache  speedup  depends  on  the  relative  operation  rates  of  the  local  external  memory  and 
inter-PE  communication  subsystems. 


2. 

Some  computational  problems  require  a  specific  number  of  physical  PEs.  If  K'  divides  P,  then 
it  is  possible  to  increase  the  number  of  PE  chips  so  that 


Ml 

M 

K'M' 


—  such  that 
KM 


In  this  case,  adding  I-cache  does  not  change  the  operational  structure  of  the  computation.  The 
same  number  of  PEs  perform  the  same  amoimt  of  work  as  in  the  computation  without  I-cache. 

The  physical  structure  of  the  computation  does  change  in  this  case,  to  reflect  decreased  time¬ 
sharing  of  PE  chip  pins.  Also,  p-set  values  may  increase  as  chips  are  added,  to  reflect  increases 
in  MCS’  inter-chip  wire  lengths  and  electrical  loads. 

3.  P'  >  P 

If  the  computational  problem  requires  at  least  a  certain  number  of  physical  PEs  and  K'  does  not 
divide  P,  then  the  I-cached  SIMD  computer  is  forced  to  contain  K'M'  >  P  PEs.  The  extra  PEs 
are  redundant.  Consider  shifting  data  around  a  linear  ring  of  PEs;  additional  shift  steps  are 
required  to  move  data  past  redundant  PEs  in  the  ring.  A  further  subtle  effect  in  this  case  is  the 
operational  structiure  of  the  computation  mtist  change  to  prevent  redimdant  PEs  from  altering 
data  used  by  the  needed  PEs.  The  redimdant  PEs  are  inconvenient.  Strategy  1 1-caching  in  this 
case  introduces  unnecessaiy  inter-PE  communication  operations,  which  tend  to  limit  I-cache 
speedup. 

There  are  a  large  nximber  of  ways  to  apply  Strategy  1,  and  there  are  many  interacting  effects  of 
reducing  the  number  of  PEs.  Considering  the  Strategy  1  effects  raises  the  question,  what  is  the  ideal 
niunber  of  PEs  to  have  in  a  PE  chip?  There  are  many  plausible  answers  to  such  a  question,  and  no 
absolute  answer.  If  the  number  of  PE  chips  may  increase,  then  Strategy  1  may  have  little  impact  on 
the  structure  of  the  computation.  On  the  other  hand,  if  there  are  a  limited  number  of  PE  chips,  then 
Strategy  1  leads  to  decreased  throughput. 


6.5  Speedups  Using  Strategy  2 


Strategy  2  is  to  reduce  the  nxunber  of  PE  registers  to  make  room  for  I-cache.  If  the  removed  registers 
are  unused  in  a  given  computation,  then  Strategy  2  makes  room  for  I-cache  at  no  cost.  On  the  other 
hand,  throughput  suffers  for  computations  that  do  use  the  available  registers. 

Figures  6.1 1 , 6.1 2, 6.1 3,  and  6.1 4  show  the  typical  effect  on  I-cache  speedup  of  varying  the  number 
of  PE  registers  for  a  program  that  uses  all  of  the  registers  that  are  available.  The  graphs  show  the 
progressively  worsening  I-cache  speedup  reduction  as  increasing  numbers  of  registers  are  displaced 
from  the  PE  chip. 

The  subject  computation  for  which  Strategy  2 1-caching  results  are  shown  is  a  variant  of  matnnal. 
The  subject  program  \ises  the  number  of  av£^able  PE  registers  as  a  parameter;  extra  registers  are 
used  as  a  register  buffer  for  matrix-column  data  otherwise  stored  in  local  external  memory. 
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matmul  on  SIMD-A  at  RHO-sat  {8£S:42} 


Figure  6.11:  Effect  of  Strategy  2  for  mat  mol  on  SIMD-A  measured  at  p-set  {8, 8,8,4, 2}. 

The  graphs  show  that  the  register  coxint  reduction  Htes  quickly.  Reducing  the  number  of  registers 
by  a  factor  of  2,  from  1024  to  512,  decreases  speedup  by  more  than  a  factor  of  2.  By  contrast,  reducing 
the  number  of  registers  by  a  factor  of  8,  from  128  to  16,  reduces  speedup  by  a  factor  of  less  than  10%. 

These  results  suggest  that  Strategy  2  has  a  drastic  negative  effect  on  I-cache  speedup.  For  the 
subject  computation,  displacing  more  than  half  of  the  PE  registers  for  I-cache  defeats  most  of  the 
I-cache  speedup.  Given  the  heavy  use  made  of  registers  by  calculation-intensive  programs.  Strategy 
2  is  not  a  good  way  of  providing  chip  area  for  I-cache. 
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matmul  on  SIMD-B  at  RHO-sat  {8:B£:42} 


e  6.12:  Effect  of  Strategy  2  for  matsBal  on  SIMD-B  measured  at  p-set  {8,8,8,4>2}. 


Figure  6.13:  Effect  of  Strategy  2  for  matiaul  on  SIMD-C  measured  at  p-set  {8, 8,8, 4, 2} 
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Figure  6.14:  Effect  of  Strategy  2  for  matoiul  on  SIMD-D  measured  at  /7-set  {8,8.8»4,2}. 
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6.6  Speedups  Using  Strategy  3 

Strategy  3  is  to  remove  some  of  the  FU  circuits  to  make  room  for  I-cache.  Removing  circuits  simplifies 
the  FU,  so  that  some  arithmetic  operations  take  more  clock  cycles  to  perform.  This  effect  is  reflected 
as  increased  FU  operation  stepcount  the  simulation  model.  Varying  FU  complexity  does  not 
change  operationally  structured  computation  descriptions. 

Figures  6.15  through  6.22  show  the  effects  of  reduced  FU  complexity  on  speedup  and  on  required 
cache  size  for  each  of  the  8  sample  problems  on  SIMD  computer  variant  SIMD-D  measured  at  p-set 
{8, 8, 8, 4, 2}. 

The  upper  graph  on  each  page  plots  F©  and  F2  I-cache  speedup  curves.  The  lower  grapii  on  each 
page  plots  the  corresponding  cache  size  required  for  each  of  the  two  I-cached  SIMD  computations. 

The  graphs  show  that  speedup  lessens  progressively  as  the  FU  becomes  progressively  simpler. 
However,  the  slope  of  the  speedup  reduction  is  less  than  that  appai^nt  for  Strategy  2.  The  reason 
for  this  gradual  speedup  decrease  is  that  reduced  FU  complexity  has  the  effect  of  lengthening  in¬ 
struction  sequences  controlling  FU  operations,  and  such  sequences  are  instruction  delivery-bound 
in  the  simulation  model.  Because  I-cache  speeds  up  repeated  instruction  delivery-bound  instruction 
sequences,  the  I-cache  lessens  the  impact  of  FU  simplification.  However,  the  fact  that  the  speedups 
do  decrease  as  FU  stepcounts  increase  indicates  that  I-cache  does  not  win  back  all  of  the  calculation 
speed  that  was  sacrificed  to  make  room  for  I-cache  using  Strategy  3. 

Note  that  the  required  cache  size  grows  nearly  linearly  with  reduced  FU  complexity  for  all  8 
problems.  The  required  cache  size  growth  is  due  to  the  machine-dependent  eissumption  that  a  new 
instruction  is  required  on  each  clock  cycle  of  a  multi-step  FU  operation.  On  one  hand,  if  the  FU 
part  of  the  instruction  set  was  similar  to  the  MCS  parts  of  the  instruction  set,  then  the  cache  size 
would  not  grow  as  quickly  as  shown  in  the  graphs.  On  the  other  hand,  hardware  used  inside  the 
PE  chip  to  control  the  FU  is  redundant  and  would  occupy  chip  area  that  could  otherwise  be  used  for 
the  calculation  components  of  PEs.  Furthermore,  for  simple  FU  operations  whose  stepcounts  are  1, 
inside-the-chip  FU  control  is  not  needed. 

The  rapid  growth  in  cache  size  is  perhaps  the  greatest  drawback  of  Strategy  3  I-caching.  FU 
simplification  to  make  room  for  I-cache  has  the  courter-productive  characteristic  of  increasing  the 
number  of  instructions  that  must  be  stored  in  cache  to  yield  the  maximum  speedup.  Given  that  chip 
area  is  precious  within  the  PE  chip,  this  effect  could  render  Strategy  3  infeasible. 
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Figure  6.15;  Speedup  and  Cache  Size  for  tree  on  SIMD-D  v.  FU  Complexity  at  p-set  {8,83,4,2} 
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Figure  6.16:  Speedup  and  Cache  Size  for  scan  on  SIMD-D  v.  FU  Complexity  at  ^set  {8, 8, 8, 4,2} 
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scan  on  SIMD-D  at  RHO-set  {8S£:4;2} 
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bubble  on  SIMO-0  at  RHO-set  {8:8:8:42} 
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Figure  6.17:  Speedup  and  Cache  Size  for  bubble  on  SIMD-D  v.  FU  Complexity  at  p-set  {8, 8, 8,4, 2} 
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Figure  6.18:  Speedup  and  Cache  Size  for  rowcol  on  SIMD-D  v.  FU  Complexity  at  p-set  {8,8, 8, 4.2} 
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Figure  6.19:  Speedup  and  Cache  Size  for  bitonlc  on  SIMD-D  v.  FU  Complexity  at  p-set  {8, 8, 8, 4,2} 
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Figure  6.20:  Speedup  and  Cache  Size  for  mattnol  on  SIMD-D  v.  FU  Complexity  at  ^set  {8, 8, 8, 4, 2} 
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Figure  6.21:  Speedup  and  Cache  Size  for  sobel  on  SIMD-D  v.  FU  Complexity  at  p-set  {8,8, 8,4, 2} 
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Figure  6.22:  Speedup  and  Cache  Size  for  median  on  SIMD-D  v.  FU  Complexity  at  p-set  {8,8>8,4.2} 
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6.7  Which  Strategy  Is  Best? 

There  is  no  one  Strategy  for  providing  chip  area  for  I-cache  that  is  universally  superior.  The  best 
Strategy  or  combination  of  Strategies  depends  on  the  requirements  of  the  problem,  the  structure  of 
the  original  computation,  and  the  resources  available  in  the  PE  chip. 

Strategy  0,  to  vise  spare  area  for  I-cache  in  the  PE  chip,  has  no  impact  on  the  PE  chip’s  payload. 
Strategy  0  would  thus  seem  ideal.  Unfortunately,  the  high  stakes  in  SIMD  computer  architecture 
for  making  good  use  of  chip  area  lead  to  PE  chips  that  typically  have  very  little  area  left  spare. 

I-cache  competes  with  PEs  for  chip  area.  Limited  chip  area  introduces  a  limit  to  cache  size, 
which  in  turn  limits  the  number  of  repeat  instructions  that  can  be  stored  for  subsequent  retrieval  at 
the  high  on-chip  rate.  Limited  cache  size  also  increases  the  quantization  effect  on  I-cache  speedup. 
A  cache  block  can  be  iterated  in  cache  only  if  it  is  wholly  stored  there.  Some  iterated  instruction 
sequences  are  too  long  to  fit  entirely  in  a  cache  of  limited  cache  size.  Being  able  to  store  only  a  part 
of  an  iterated  instruction  sequence  means  that  it  cannot  be  iterated  in  cache.  When  the  cache  size 
is  too  small,  the  cache  block  iteration  capability  of  I-cache  variants  including  F2  cannot  be  exploited, 
so  quantization  effects  become  apparent  as  they  do  for  non-iterating  I-cache  variants  including  Fq. 

Strategy  1,  to  displace  PEs  firom  the  PE  chip  to  make  room  for  I-cache,  increases  the  amount  of 
work  that  is  performed  by  the  remaining  PEs.  Increasing  the  per-PE  workload  tends  to  increase 
iteration  covmts,  which  acts  to  increase  the  benefit  of  I-cadie.  However,  a  number  of  programming 
problems  arise  as  a  consequence  of  Strategy  1.  For  example,  problem  data  may  no  longer  fit  into 
PE  registers,  thereby  increasing  the  reliance  on  local  external  memory.  Increased  local  memory 
usage  tends  to  decrease  I-cache  speedup  through  increasing  the  local  memory-bovmdedness  of  a 
computation.  Strategy  1  has  a  surprising  throughput-increasing  effect:  reducing  the  number  of  PEs 
in  the  PE  chip  reduces  time-sharing  of  PE  chip  pins.  Reduced  pin  time-sharing  decreases  the  time 
taken  to  perform  MCS  operations,  which  tends  to  decrease  subsystem-boundedness  of  computations. 
Less  time  spent  waiting  for  MCS  operations  to  complete  means  greater  throughput,  which  in  turn 
increases  the  apparent  speedup  due  to  I-cache. 

Strategy  2,  to  remove  PE  registers  to  make  room  for  I-cache,  may  have  no  effect  or  it  may  have  an 
enormous  effect.  If  a  computation  uses  only  a  small  number  of  registers,  then  the  registers  removed 
to  make  room  for  I-cache  were  redvmdant  and  are  removed  at  no  cost.  However,  a  computation  that 
uses  all  of  the  available  PE  registers  makes  greater  use  of  local  external  memory  under  Strategy 
2.  Increased  local  memory-intensiveness  means  that  the  computation  is  more  likely  to  be  memory- 
bound,  thus  exhibiting  reduced  I-cache  speedup. 

Strategy  3,  to  remove  some  PE  FU  circuits  to  make  room  for  I-cache,  increases  the  number  of 
PE  clock  cycles  needed  to  perform  some  arithmetic  ox)erations.  Strategy  3  makes  computations  more 
likely  to  be  calculation-boimd.  Although  I-cache  is  most  useful  for  calculation-bound  computations 
on  the  simvilation  model  of  SIMD  computation.  Strategy  3  yet  degrades  I-cache  speedup.  Strategy 
3  has  the  additionsd  negative  consequence  of  drastically  increasing  the  cache  size  needed  to  obtain 
maximum  I-cache  speedup  by  increasing  the  lengths  of  machine  code  instruction  sequences  used  to 
control  arithmetic  operations. 

In  svimmary,  the  objectives  of  SIMD  computer  architecture  lead  to  PE  chips  that  are  simplified 
to  the  point  allowed  by  the  calculation  requirements  of  the  problem  mix.  Unless  the  PE  is  overly 
complicated  for  a  given  problem,  for  example  containing  too  many  registers  or  unused  single-cycle  FU 
operations,  further  simplification  of  the  PE  as  per  Strategies  1 , 2,  or  3,  is  bound  to  reduce  throughput. 

On  the  other  hand,  the  estimates  of  Section  4.3  indicate  that  I-cache  with  small  cache  size  occupies 
negligible  chip  area  in  modem  chips.  Fiuthermore,  the  results  of  Chapter  5  show  that,  while  small, 
simple  I-cache  variants  indeed  provide  substantial  speedups  for  the  sample  problems.  Therefore, 
it  is  unlikely  that  a  large  proportion  of  the  PE  chip  payload  would  need  to  be  displaced  to  make 
room  for  I-cache  that  is  useful  for  simple  problems.  However,  there  are  problems  that  demand  more 
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complicated  I-cache  variants  or  larger  caches.  If  the  local  controller  with  I-cache  were  to  become 
very  large,  for  example,  displacing  all  but  1  of  the  PEs  in  the  chip,  then  the  resulting  I-cached  SIMD 
computer  would  become  a  tightly  synchronized  MIMD  computer. 
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Chapter  7 

Conclusion 


The  addition  of  I*cache  to  SIMD  computers  makes  them  faster.  The  speedup  depends  on  program 
properties,  PE  architecture,  chip  characteristics,  and  the  electrical  characteristics  of  multi-chip 
subsystems.  Detailed  simulations  of  SIMD  computations  for  a  diverse  collection  of  sample  problems 
on  a  variety  of  hardware  configurations  show  substantial  speedups,  even  for  simple  I-cache  variants. 

Ideally,  an  I-cache  is  large  enough  to  store  entire  iterated  sequences,  thereby  attaining  the 
speedup  possible  firom  controlling  iterations  within  the  PE  chip.  Also,  the  larger  the  cache  memory, 
the  less  pronoxinced  the  thrashing  due  to  conflict  misses.  Unfortunately,  I-cache  cannot  be  arbitrarily 
large  because  it  occupies  chip  area  that  could  otherwise  have  been  used  for  PEs.  'Die  Rimnlations 
show  that  small  I-caches  are  useful  for  the  subject  problems.  On  this  basis,  it  is  reasonable  to  expect 
even  small  I-cache  to  be  somewhat  useful  for  larger  problems.  Detailed  estimates  indicate  that 
I-cache  containing  1000  32-bit  instructions  would  occupy  less  than  5%  of  the  chip  area  of  a  modem 
PE  chip. 

The  appropriate  complexity  of  the  I-cache  design  clearly  depends  on  program  structiire.  For 
a  simple  loop  with  data-independent  iteration  count,  a  simple,  statically  managed  I-cache  variant 
exploits  all  of  the  potential  speedup.  A  program  containing  multiple  alternating  repeated  instruction 
sequences  demands  a  slightly  more  complex  I-cache  variant.  When  inner  loop  iteration  counts  depend 
on  outer-loop  iteration  index  values,  it  is  desirable  to  have  a  yet  more  complex  I-cache  variant  that 
calculates  iteration  coimts  on  its  own.  Finally,  exploiting  the  maximum  I-cache  speedup  for  a 
program  with  arbitrary,  data-dependent  control  flow  may  require  a  dynamically  managed  I-cache 
whose  program-control  complexity  approaches  that  of  the  system  controller  itself  The  l^t  I-cache 
is  determined  by  the  requirements  of  the  computation. 

Predicting  the  I-cache  speedup  for  an  arbitrary  high-level  language  program  is  difficult,  due  to  the 
complex  interactions  among  PE  chip  characteristics,  system  characteristics,  I-cache  capabilities,  and 
program  properties.  Calculation-intensive  loop  bodies  iterated  large  numbers  of  times  make  for  large 
I-cache  speedups.  While  programs  that  use  MCSs  intensively  tend  to  exhibit  lower  I-cache  speedups 
than  those  that  don’t,  I-cache  should  yield  some  speedup.  For  example,  the  sample  problems  include 
some  that  are  ordinarily  considered  to  be  inter-PE  communication-bound.  And  yet  the  simxilations 
show  considerable  I-cache  speedups  for  all  problems.  This  phenomenon  occurs  because  a  high-level 
programming  language’s  “communication’*  operation  is  realized  using  machine  code  instructions  that 
perform  address  calculations  or  context  management  operations.  I-cache  makes  such  calculations 
faster. 

This  analysis  of  I-cached  SIMD  computer  architecture  emphasizes  the  throughput  performance 
metric  and  the  chip-area  hardware  cost  metric  For  scalable  data-parallel  computations,  wherein 
throughput  is  proportional  to  the  number  of  PEs  that  are  brought  to  bear,  Tninitniging  the  chip  area 
per  PE  maximizes  the  number  of  PEs  in  a  given  total  chip  area,  thus  maximizing  throughput,  all 
else  being  equal.  Unfortunately,  not  all  else  is  equal,  because  to  drastically  reduce  the  chip  area  of 
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a  PE  by  removing  its  program  control  is  to  introduce  a  limitation  in  the  rate  at  which  instructions 
are  supplied.  The  measured  I-cache  speedups  confirm  the  prindple  that  to  make  the  best  use  of  chip 
area,  a  multiprocessors  PE  chips  shoiild  contain  the  right  amount  of  redundantly  replicated  program 
control:  Too  much  and  there  are  fewer  PEs  than  there  could  have  been,  too  few  and  the  PEs  execute 
instructions  too  slowly. 

This  principle  impUes  that  a  compromise  between  the  MIMD  and  SIMD  architectural  extremes 
is  valuable  for  data-parallel  programs.  In  general,  the  electrical  propagation  characteristics  of  the 
technology  used  to  fabricate  the  computer  dictate  the  ideal  way  of  distributing  program  control 
throughout  the  computer.  The  program  control  provided  within  each  layer  of  the  integration  hi¬ 
erarchy  (including  chips,  multi-chip  modules,  printed-drcuit  boards,  racks,  and  so  forth)  should  be 
simple  so  that  the  greatest  proportion  of  resomxes  are  used  for  PE  calculations  and  yet  sxifElciently 
powerful  to  be  able  to  provide  control  within  its  own  layer  of  the  abstraction  hierarchy  at  the  highest 
rate  attainable  therein.  In  this  hght,  the  generally  accepted  taxonomic  distinction  between  MIMD 
computers  and  SIMD  computers  as  “equal”  design  alternatives  appears  to  be  misleading.  Rather, 
SIMD  computer  architecture  is  the  specialization  of  MIMD  computers  as  appropriate  for  specific 
data-parallel  computations. 


7.1  Future  Directions 

This  analysis  introduces  the  issues  pertaining  in  I-cache  design  and  explores  how  properties  of 
programs,  systems,  and  chips  interact  in  determining  I-cache  speedup.  SIMD  instruction  cache  is 
a  new  idea,  and  an  I-cached  SIMD  computer  has  yet  to  be  built.  In  establishing  that  even  simple 
I-cache  variants  yield  significant  speedups  over  a  collection  of  sample  programs,  large  portions  of 
an  apparently  vast  design  space  have  been  left  unexplored.  Only  a  relatively  small  number  of  the 
many  alternative  physically  structured  variants  of  a  subject  computation  have  been  measured,  only  a 
relatively  small  number  of  the  many  alternative  transformations  of  the  subject  computation  to  reflect 
I-cache  have  been  measured,  and  only  a  relatively  small  number  of  the  many  alternative  physically 
structured  variants  of  the  I-cached  computation  have  been  measured.  This  section  enumerates  some 
of  the  areas  in  which  to  extend  and  improve  the  analysis. 

7.1.1  Problem  Characteristics 

A  natural  and  important  extension  is  to  analyze  computations  that  are  larger  in  scope  than  the  sample 
problems  used  here.  For  example,  programs  whose  loop  structures  are  not  statically  analyzable  due 
to  data-dependence  would  provide  a  basis  for  evaluating  dynamic  cache  management  techniques. 

Extending  the  study  to  more  complicated  problems  would  be  fadhtated  by  a  high-level  data- 
parallel  language.  High-level  language  programming  introduces  an  array  of  compiler  issues.  It 
might  prove  valuable  to  quantitatively  assess  the  dependence  of  I-cache  speedup  on  the  mapping 
between  problem  input  data  and  PEs,  or  the  interactions  between  I-cache  speedup  and  compiler 
optimizations  including  register  allocation  and  scheduling. 

7.1.2  System  Characteristics 

For  programs  with  complex  loop  structures,  it  may  not  be  possible  for  a  compiler  to  determine  best 
which  blocks  to  place  in  I-cache  and  when  to  store  them  there.  A  system  controller  component  to 
perform  these  functions  d3niamically  presents  an  interesting  design  challenge. 

The  simulation  model  used  in  the  evaluation  contains  a  single  PE  FU  and  a  single  instance  of 
each  type  of  MCS.  Relaxing  this  assiimption,  for  example  so  that  the  PE  coiild  contain  multiple  FUs 
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perhaps  operating  at  different  rates,  would  allow  the  model  to  represent  a  wider  range  of  SIMD 
computers. 

The  simulation  model  used  in  the  evsduation  incorporates  synchronous  elements  throughout 
the  computer.  A  multi-clock  generator  provides  the  clocks  within  the  PE  chip  that  coordinate  the 
variously  timed  subsystems.  The  multi-clock  generator  design  iised  in  the  basis  computer  is  fairly 
clumsy.  A  good  multi-clock  generator  is  an  interesting  design  problem. 

7.1.3  I-Cache  Characteristics 

Although  the  simplest  I-cache  variants  that  have  been  studied  in  detail  yield  considerable  speedups 
for  the  sample  problems,  more  complicated  programs  require  more  complicated  I-cache  variants. 

Beyond  examining  the  properties  of  the  members  of  the  F-family  beyond  Fq  and  F2,  it  would  also 
be  interesting  to  evaluate  I-caches  with  mviltiple  ports.  With  multiple  ports,  one  cache  block  can  be 
stored  while  another  is  active,  a  form  of  prefetching  that  reduces  the  time  spent  waiting  for  cache 
blocks  to  be  stored.  Multi-port  I-caching  introduces  the  complexities  of  concurrent  accesses  to  cache 
memory.  The  management  problem  here  includes  handling  partial  block  stores,  as  will  occur  when 
a  cache  block  finishes  executing  before  another  has  been  completely  stored  through  the  second  poit. 

The  identical  I-cache  speedups  measiired  for  Fq  and  for  F2  on  the  bitonlc  sorting  problem 
show  that  there  is  no  advantage  fimm  F2  in  that  case.  The  reason  is  that  the  iteration  counts  of 
the  inner  loop  in  blbonlc  depend  on  the  value  of  the  outer  loop  index.  One  way  to  overcome  this 
limitation  is  to  make  it  possible  to  use  the  system  controller’s  index  register  subsystem  to  evaluate 
loop-index-dependent  expressions  to  use  in  specifying  iteration  counts  in  F2  cache  block  activations. 

The  low  I-cache  speedups  shown  for  F2  on  the  xowcol  sorting  problem  arise  fix)m  the  inflexibility 
of  iteration  in  Fa.  The  number  of  iterations  of  the  inner  loop  in  xowcol  depends  on  the  initial 
permutation  of  the  data  to  be  sorted,  and  this  number  of  iterations  differs  each  time  around  the 
outer  loop.  The  iteration  count  for  an  F2  cache  block  is  specified  when  the  block  is  activated.  If  a 
local  counterpart  of  the  response  network  were  incorporated  in  the  PE  chip,  then  the  local  controller 
could  be  made  to  sense  the  completion  condition  for  its  chip’s  complement  of  PEs. 

The  possibility  of  the  local  controller  sensing  "global”  data-dependent  conditions  among  its  group 
of  PEs  is  one  example  of  how  conditional-control-fiow  programs  would  exploit  I-cache.  Execution  in 
this  case  resembles  pseudo-MIMD  computation  [10],  wherein  each  PE  chip  operates  independently 
as  a  SIMD  computer  in  its  own  right  diuing  parts  of  a  computation. 

If  the  sequencing  performed  by  the  cache  controller  is  able  to  depend  on  PE  data  conditions,  then 
it  becomes  possible  to  incorporate  data  caching  in  SIMD  computers.  With  D-cache,  the  time  for  a 
PE’s  local  external  memory  access  varies  according  to  that  PE’s  addressing  pattern.  If  the  cache 
controller  is  able  to  detect  cache  hits,  it  may  sequence  cache  blocks  appropriately  for  its  complement 
of  PEs.  Inter-PE  communication  requires  re-synchronization  of  the  PE  chips,  which  limits  the  benefit 
of  D-caching.  A  characterization  of  the  benefits  of  D-cache  and  the  circumstances  xmder  which  those 
benefits  are  realized  might  be  valuable. 

7.1.4  Evaluation  Mechanism 

Figure  5.1  illustrates  the  method  used  to  evaluate  I-cache  speedup  for  sample  problems.  The  trans¬ 
lation  step  from  assembly  language  programs  to  machine  code  programs  uses  timing  information 
about  operations  to  schedule  machine  code  instructions.  The  schedtiling  algorithm  attempts  to  over¬ 
lap  the  execution  of  mutually  flow-independent  MCS  and  FU  instructions  where  possible  through 
reordering  the  instructions.  As  is  apparent  in  Figures  6.3  through  6.10  showing  the  effects  of  limited 
cache  size  on  speedup,  the  scheduler’s  attempt  at  optimization  interacts  in  surprising  ways  with  the 
cache  size.  Basic  blocks  reordering  takes  place  prior  to  the  assignment  of  instructions  to  instruction 
memory  locations.  When  an  instruction  sequence  turns  out  not  to  fit  in  cache  memory,  the  reordering 
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is  undone  in  a  conservative  manner.  The  effect  of  the  schedtiler’s  simple  algorithm  for  handling 
limited  cache  size  shows  up  as  the  occasional  non-monotonicdty  of  l*cache  speedup  versus  cache  size 
apparent  in  Figures  6.3  through  6.10. 

There  is  also  room  for  improvement  in  the  clocking  assumptions.  Each  subsystem  in  the  sim¬ 
ulation  model  is  regulated  by  a  clock  with  a  (potentially)  unique  rate.  The  multi-clock  generator 
generates  these  multiple  clocks  at  the  requisite  rates,  subject  to  the  following  restrictions: 

1 .  The  PE  clock,  regulating  the  PEs  and  the  local  controller,  is  the  fastest  clock  in  the  computer. 

2.  The  system  clock  is  the  slowest  clock  in  the  computer. 

3.  All  clocks  are  free-running. 

4.  All  clocks  are  phase-locked  to  the  system  clock. 

5.  All  clock  rates  are  integer  multiples  of  the  system  clock  rate. 

6.  All  clock  rates  are  integer  sub-multiples  of  the  PE  clock  rate. 

Each  subsystem  clock’s  interval  can  be  no  less  than  the  duration  of  the  subsystem’s  longest 
operation  step.  However,  the  restrictions  listed  above  force  the  rate  of  a  clock  to  be  lower  than 
necessary  in  some  cases.  Relaxing  the  arbitrary  restrictions  would  allow  a  more  complete  exploration 
of  I-cache  speedup  sen;  Itivities  to  p-set  values. 

The  restriction  that  the  clocks  are  free-running  appears  to  be  an  unfortunate  design  mistake  on 
my  part.  For  example,  if  the  clocks  were  instead  re-startable  on  an  arbitrary  cyc-e  of  the  PE  clock, 
then  an  otherwise-idle  MCS  could  begin  an  operation  on  the  earliest  possible  PE  instruction.  A 
free-running  MCS  clock  introduces  uimecessary  delays  in  starting  some  MCS  operations,  because 
the  instruction  specifying  the  operation’s  commencement  cannot  be  applied  until  the  MCS  clock  and 
the  PE  clock  are  in-phase.  The  two  clocks  are  in-phase  only  once  every  MCS  dock  cyde. 

It  would  be  interesting  to  extend  the  timing  model  to  indude  asynchronous  implementations 
of  components.  In  prindple,  an  asynchronous  system  need  never  operate  at  lower  than  the  inher¬ 
ent  maximum  rate,  subject  to  flow-dependendes.  The  impact  of  asynchrony  would  be  particularly 
profound  where  operation  step  durations  for  a  given  subsystem  vary  widely. 


7.2  How  Important  is  SIMD  Instruction  Cache,  Really? 

For  real  problems  and  for  realistic  eussumptions  regarding  the  electrical  properties  of  a  high-PE-count 
SIMD  computer,  I-cache  3rields  speedups  of  30%  to  more  than  700%  over  generic  SIMD  computation. 
From  the  high-level  language  programmers  point  of  view,  the  effect  of  I-cache  is  similar  to  that  of 
increasing  chip  dock  rates  by  many  times,  with  concomitant  speedups  up  to  limits  imposed  by  the 
requirements  for  inter-chip  communication  inherent  in  a  given  program. 

As  VLSI  implementation  technique  continues  to  improve,  the  time  to  drive  a  lumped  capadtance 
from  the  gate  of  a  minimum  inverter  across  a  chip  increases.  It  becomes  less  reasonable  to  view 
a  chip  as  a  single  fast  circuit  domain  and  more  reasonable  to  view  a  chip  itself  as  a  collection  of 
fast  drcuit  domains  etmong  which  commxinication  is  slow  or  expensive.  At  some  point  in  this  scaling 
path,  the  time  required  to  distribute  instructions  locally  within  the  ever-larger  PE  chip  itself  becomes 
throughput-limiting.  At  that  i)oint,  a  single  local  controller  in  the  PE  chip  can  no  longer  provide 
instructions  to  the  PEs  at  the  maximum  rate  of  PE  operation.  To  coimter  this  limitation,  it  will 
eventually  become  advantageous  to  re-apply  I-caching  within  the  PE  chip  itself,  replicating  enough 
local  control  within  the  chip  to  keep  the  PEs  supplied  with  instructions  at  the  maximum  attainable 
operation  rate. 
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The  speedups  possible  fix)m  I-cache  are  limited  to  constant  factors  of  perhaps  2  to  7  or  so,  far  less 
than  the  asymptotic  improvements  available  for  some  problems  using  parallelism.  At  first  glance,  I- 
cache  appears  to  facilitate  flexible  encodings  that  increase  the  amount  of  control  information  provided 
to  the  PEs  through  an  inherently  slow  channel.  Another  way  to  think  of  I-cache  is  as  a  means  to 
exploit  the  throughput  benefits  of  xising  large  numbers  of  parts  in  parallel,  while  retaining  the 
operation-rate  advantages  inherent  in  keeping  the  parts  themselves  small. 

The  sample  problems  were  chosen  for  their  expected  dissimilarity.  For  example,  sorting  and 
tree-reduction  are  commonly  thought  to  be  inter-PE  commiinication-bound,  and  thus  might  not  be 
expected  to  yield  much  I-cache  speedup,  unlike  matrix  multiply,  which  is  known  to  be  calculation¬ 
intensive.  For  all  of  the  sample  problems,  the  benefits  of  I-cache  more  than  compensate  for  the 
chip-area  cost  of  enhancement.  The  consistency  of  the  results  across  the  range  of  problems  points 
strongly  to  the  conclusion  that  throughput  is  significantly  higher  in  I-cached  SIMD  computers  than 
in  their  generic  coimterparts,  even  for  problems  commonly  thought  to  be  subsystem-bound.  The 
I-cache  speedups  for  complete,  practical  applications  could  possibly  be  greater  or  less  than  those 
characterized  here,  depending  on  spedfic  communication  and  calculation  requirements  relative  to 
the  FU  and  MCS  chziracteristics  cf  the  imderlying  SIMD  computer.  Aside  fi-om  characteristically 
simple  loop  structures,  the  sample  programs  are  not  extraordinary  in  their  operation  mixes. 

The  measured  results  reflect  the  assumption  in  the  simulation  model  that  there  is  no  local 
control  in  the  PE  chip  for  the  FU,  although  there  is  local  control  in  the  PE  chip  for  the  MCSs.  A 
PE  chip  might  contain  local  control  for  the  FU,  thus  shortening  instruction  sequences  and  lessening 
the  apparent  I-cache  speedup.  Alternatively,  a  PE  chip  might  contain  no  local  control  for  MCSs, 
thus  lengthening  instruction  sequences  and  increasing  the  payoff  firom  I-cache.  In  any  case,  the 
measurements  presented  for  SIMD-D,  whose  PEs  perform  32-bit  multiply  in  one  clock  cycle,  are  not 
affected  by  the  lengthening  of  FU  instruction  sequences  arising  from  the  assumptions  regarding  local 
control. 

Clearly,  the  coverage  of  this  work  is  not  exhaustive,  and  there  exist  many  further  avenues  of 
research  that  foUow  fix»m  it.  The  results  presented  here  have  consequences  for  the  analysis  of 
problems  solved  by  SIMD  computers,  and  for  languages  and  compilers  used  in  describing  those 
solutions.  The  measm^ment  method  provides  a  means  for  studying  in  detail  with  respect  to  specific 
problems  the  interactions  among  system  controller,  PE,  MCS,  and  I-cache  designs.  These  research 
avenues  become  compelling  in  light  of  the  high  stakes  for  providing  instructions  to  PEs  at  the  highest 
rates. 


7.3  Final  Comments 

The  analysis  has  been  performed  for  VLSI-based  computers.  However,  the  reader  may  see  that  these 
results  should  apply  with  equal  vahdity  given  any  computer  implementation  technology  wherein 
information  is  represented  as  energy  that  is  spatially  distributed  in  three  dimensions.  The  simple 
underlying  principle  exploited  in  I-cached  SIMD  computer  architecture  is,  “the  more  energy  to  be 
re-distributed  per  computation  step,  the  Izirger  the  radius  over  which  it  is  re-distributed,  the  slower 
or  more  expensive  the  computation.”  I-cache  makes  it  possible  to  control  large  numbers  of  PEs  that 
are  packed  as  densely  as  possible  within  chips  without  having  to  re-broadcast  repeated  sequences  of 
instructions  through  a  relatively  slow  channel. 

I-cache  speedup  is  bounded  above  by  pb-  Pb  niay  not  be  much  higher  than  8  in  a  practical  computer. 
This  number  is  far  less  than  the  largest  number  of  useful  PEs  in  a  multiprocessor.  Therefore,  the 
possible  gain  firom  adding  I-cache  to  SIMD  computers  is  not  nearly  so  compeUing  as  the  possible 
gain  from  parallehsm  itself  This  point  is  underscored  by  the  observation  that  I-cached  SIMD 
computers  are  generally  useful  only  for  data-parallel  problems,  a  subset  of  the  set  of  problems  for 
which  parallelism  is  advantageous. 
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In  a  generic  SIMD  PE  chip,  K  times  more  chip  area  is  allocated  to  FU,  context  manager,  and 
data  registers  than  in  its  same-technology  MIMD  counterpart.  And  yet,  a  generic  SIMD  computer’s 
PEs  operate  at  a  rate  pb  times  lower  than  that  of  the  corresponding  MIMD  computer.  For  a  scalable 
data-parallel  computation,  the  following  relationship  obtains,  given  a  limited  implementation  budget 
with  respect  to  total  chip  area: 


generic  SIMD  computer  throughput  <  —  *  MIMD  computer  throughput  (7.1) 

Ph 

The  *^1:07  has  been  out”  with  respect  to  the  relative  merits  of  SIMD  computers,  and  in  fact  it 
has  recently  been  trickling  in  with  a  negative  verdict.  The  reason  might  be  that  when  K  »  p^, 
generic  SIMD  computer  throughput  is  roughly  equivalent  to  that  of  the  MIMD  counterpart,  and 
SIMD  computers  are  inherently  more  difficult  to  program.  The  main  reason  that  SIMD  computers 
have  been  attractive  to  some  manufacturers  would  be  that  it  is  possible  to  produce  a  given-PE-count 
SIMD  computer  using  much  lower  total  chip  area  than  for  a  MIMD  coimterpart. 

1-cache  overcomes  the  instruction  delivery  limitation  that  contributes  the  factor-of-pb  denominator 
in  Equation  7.1 .  If  the  PE  chip  is  allowed  to  expand  slightly  to  accommodate  I-cache  sufficiently  large 
to  obtain  a  speedup  of  nearly  pb  for  a  given  computation,  then  the  following  relationship  obtains: 


I-cached  SIMD  computer  throughput  <  K  *  MIMD  computer  throughput  (7.2) 

This  comparison  on  the  basis  of  throughput  and  chip  area  notably  neglects  factors  such  as  market 
size  that  impact  monetary  cost  of  chip  design  and  fabrication.  Even  if  I-cached  SIMD  computers 
exhibit  the  highest  throughput-to-area  ratios,  they  are  not  necessarily  preferable,  even  for  scalable 
data-parallel  problems.  The  SIMD  PE  chip  is  a  low-volume  part,  whereas  MIMD  computers  are  often 
made  using  PEs  that  are  microprocessors  fabricated  in  medium-to-high  volumes.  Economies  of  scale 
lead  to  a  unit  cost  for  the  SIMD  PE  chip  which  is  a  factor  of  L  times  higher  than  that  of  the  MIMD 
PE  chip.  Under  the  simple  assumption  that  computer  cost  scales  linearly  with  L,  the  throughput  per 
cost  ratio  of  the  SIMD  computer  is  no  more  than  ^  that  of  the  MIMD  computer.  If  I  is  as  large  as 
or  larger  than  K,  then  SIMD  computer  architecture  can  be  justified  only  for  applications  for  which 
making  the  best  use  of  total  chip  area  is  paramount.  It  is  for  these  area-critical  applications  that 
l-cached  SIMD  computer  architecture  is  a  compelling  choice. 


Appendix  A 

The  Basis  Computer 


Rather  than  having  been  evaluated  for  a  single  specific  SIMD  computer,  I>cache  variants  have  been 
evaluated  for  a  range  of  SIMD  computers.  A  parameterized  SIMD  computer  was  designed  and  used 
as  a  basis  for  the  evaluations.  This  appendix  describes  the  machine  code  programming  of  the  basis 
computer  and  highhghts  the  “machine-dependent”  aspects  of  its  design  that  affect  I-cache  evaluation. 

Given  the  large  number  and  variety  of  SIMD  computers  that  have  been  published  since  the 
Solomon  computer  was  described  in  1962  [74],  such  generality  would  seem  intractable.  Fortunately, 
the  task  is  simplified  by  the  observation  that  only  a  relatively  small  number  of  VLSI-based  SIMD 
computers  have  been  reported  to  date,  including  Vaster  [87],  CAAPP  [81 ,82],  SLAP  [26],  Bhtzen  [37], 
and  MP-1  [33,  8].  In  a  VLSI-based  SIMD  computer,  the  PE  is  an  integrated  circuit  capable  of 
performing  calculations  within  the  confines  of  a  PE  chip.  This  definition  of  a  VLSI-based  SIMD 
computer  reqviires  that  the  PE’s  FU  component  be  packaged  in  a  chip  along  with  some  amoimt  of 
register  memory.  This  restriction  rules  out  candidates  such  as  CM-2,  whose  bit-serial  PEs  require  3 
off-chip  memory  references  to  perform  a  single  full  adder  step  [18]. 

The  simulated  computer's  generahty  causes  some  of  the  details  to  be  more  intricate  than  would 
be  required  in  an  actiud  design.  The  generalization  is  less  than  perfect,  and  the  design  reflects  some 
assumptions  about  specific  subsystems.  While  the  details  of  actual  and  foreseeable  SIMD  computers 
vary  over  a  large  space,  the  design  of  the  basis  computer  provides  a  general  rmderstanding  of  the 
mechanisms  that  are  involved  in  I-cache. 

A.1  PE 

The  PE  contains  an  FU,  register  memory,  a  context  manager,  and  interfaces  to  the  MCSs.  The 
PE  components  are  inter-connected  by  blisses.  An  actual  PE  might  depart  from  this  basic  design, 
for  example  by  having  multiple  specialized  function  units  or  by  having  point-to-point  internal  inter¬ 
connections  rather  than  busses.  The  machine-code  language  programmer’s  view  of  the  PE  is  sketched 
in  Figure  Al,  showing  the  busses  inter-connecting  the  register  file,  the  FU,  and  the  MCS  interface 
registers. 

An  important  characteristic  of  the  generic  computer  used  in  the  experiments  is  that  local  control 
is  not  provided  in  the  PE  chip  for  the  FU,  whereas  local  control  is  provided  for  the  MCSs.  This  aspect 
of  the  design  of  an  ostensibly  generic  computer  is  justified  by  the  observation  that  the  local  control  of 
an  MCS  is  potentially  simple,  whereas  the  local  control  of  the  FU  is  potentially  complex.  For  example, 
compare  the  local  external  memory  subsystem  local  control  illustrated  in  Figure  A.2  against  one  that 
generates  the  sequence  of  control  signals  for  multiplying  32-bit  floating-point  numbers  on  a  4-bit 
PE.  This  assertion  regarding  PE  chip  local  control  appears  liberal  finm  the  point  of  view  of  the 
goals  of  the  experiment,  because  it  causes  FU  calculations  to  appear  instruction  deUvery-rate  bound. 
However,  recall  that  the  SIMD  idea  is  to  remove  as  much  redundant  local  control  firom  the  PE  chip 
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PE 


busses 


Figure  A.1 ;  PE  Architecttiral  Components.  Dotted  lines  indicate  MCSs  to  which  the  PE  is  connected 
through  interface  registers. 

as  practicable.  In  the  sense  that  redimdant  local  control  is  provided  in  the  PE  chip  so  that  the  MCSs 
tend  not  to  be  instruction  delivery-rate  limited,  this  assertion  is  actually  conservative. 


A.2  Local  External  Memory  Subsystem 

The  PEs  within  a  PE  chip  share  a  single  access  port  to  local  external  memory.  While  this  assiunption 
is  reasonable  for  wide-word  PEs  and  for  PEs  that  provide  their  own  addresses  to  memory,  it  does  not 
accurately  model  local  external  memory  access  in  SIMD  computers  whose  PEs  cazmot  provide  their 
own  addresses  to  memory  (Use  of  a  single  memory  address  for  all  PEs,  as  in  CM-2  [22],  reduces 
the  PE  chip  pin  cost  of  local  external  memory)  The  assumption  that  the  PEs  generate  local  external 
memory  addresses  restricts  the  applicability  of  the  resulting  measurements  to  systems  whose  PEs 
have  that  capability. 

Figure  A.2  shows  how  the  local  external  memory  subsystem  is  organized.  The  small  control 
circuit  inside  the  PE  chip  illustrates  the  potential  simplicity  of  MCS  control. 


A.3  System  Controller 

The  primary  components  of  the  system  controller  are  a  microcoded  sequencer  and  a  mechanism  for 
evaluating  loop-index-dependent  expressions.  The  system  controller  is  intended  to  provide  the  basic 
required  control  functions  required  xjsing  a  smedl  set  of  single-cycle  operations.  An  actual  system 
controller  may  be  optimized  for  specific  computations  being  performed,  as  for  example  is  the  case  in 
SLAP  [28].  The  potentially  crucial  topic  of  SIMD  system  controller  design  is  beyond  the  scope  of  the 
thesis. 

The  system  controller  is  shown  in  Figure  A.3.  The  system  controller  is  partitioned  into  subsystems 
that  perform  the  functions  enumerated  in  the  following  list: 

1 .  Generate  the  system  dock  (system  dock  generator), 

2.  Store  the  program  controlling  the  computation  (instruction  memory), 

3.  Sequence  the  control  program  (sequencer). 


FECHt^  COMPONENT  Of 
LOCAL  EXTENNAL  MEMOflY  SUBSYSTEM 


LEM.CLK 
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Figure  A.3:  System  Controller 


4.  Evaluate  loop-mdex>dependent  expressions  and  inject  the  calculated  values  into  PE  chip  ma¬ 
chine  code  instructions  as  required  (indexer), 

5.  Gather  components  from  successive  machine  code  instructions  to  form  a  broadcast  instruction 
that  controls  a  single  <7cle  of  PE  activity  (stager). 


A.4  Machine  Code  Programming  Language 

A  machine  code  program  for  the  basis  computer  is  a  table  of  instructions.  Each  row  of  the  table  has  an 
index  (increasing  firom  0)  and  contains  a  machine  code  instruction.  Figmre  C.4  contains  an  example 
of  a  machine  code  program.  The  machine  code  instruction  fields  are  shown  in  Figure  A.4.  Each  field 
of  a  machine  code  instruction  takes  a  numeric  value,  or  equivalently  the  mnemonic  representation 
of  a  numeric  value.  A  field  whose  value  is  not  specified  takes  a  standard  “null”  value.  As  indicated 
in  Figure  A.4,  each  machine  code  instruction  has  two  parts:  a  system  controller  part  and  a  PE  part. 


System  Controller  Instruction 

(proto-)  PE  Instruction 

SC-Operation  fl.  f2  f3 

C  Dest  Operation 

SrcA  SrcB  LiteraUBase-Value 

Figxire  A.4:  machine  code  Instruction  Word  Components 
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A.4.1  System  Controller  Instructions 

There  are  two  classes  of  system  controller  instruction,  as  selected  by  the  SC.Operatlon  field:  sequencer 
instructions  and  indexer  instructions.  These  two  types  of  instructions  differ  as  follows: 

•  Sequencer  instructions  resemble  those  of  a  typical  microprogram  controller  (such  as,  for  exam¬ 
ple,  the  49C410  [45]).  Sequencer  instructions  may  specify  conditional  branches  or  subroutine 
calls  and  returns.  The  standard  *^ull’’  sequencer  instruction  specifies  that  the  next  instruction 
is  the  one  following  the  present  instruction  in  linear  sequence  in  the  program.  An  indexer 
instruction  implies  the  standard  ‘^ijdl”  sequencer  instruction. 

•  Indexer  instructions  provide  for  initializing,  copying,  and  incrementing  members  of  a  set  of 
index  registers.  A  value  produced  by  an  indexer  instruction  is  added  to  the  PE  instruction’s 
Literal^ase.Value  to  form  a  literal  for  broadcast  to  the  PEs.  The  standard  "null"  indexer  in¬ 
struction  does  not  alter  any  index  registers  and  produces  the  value  0.  A  sequencer  instruction 
implies  the  standard  "nuU”  indexer  instruction. 

The  system  controller  part  of  the  stored  machine  code  instruction  specifies  an  operation  of  either 
the  sequencer  or  the  indexer.  These  operations  are  horizontally  microcoded.  The  encodings  of  the 
four  fields  of  the  system  controller  part  of  the  stored  instruction  are  shown  in  Figure  A.5.  When  the 
SC-Opezation  field  specifies  an  indexer  operation,  the  implied  sequencer  operation  is  CONT;  when 
the  SC-Ppezatlon  field  specifies  a  sequencer  operation,  no  indexer  operation  is  performed. 


SC.Operation 

a 

f3 

Sequencer 

Operation 

CJSR 

Condition  Select 

Branch  Target 

Initial  Iteration  Count 

LTST 

Condition  Select 

Brandb  Target 

CBR 

Condition  Select 

Branch  Target 

COOT 

HALT 

Indexer 

Operation 

LDX 

Write  Address 

Index  literal 

SPX 

Write  Address 

Index  literal 

CPX 

Write  Address 

Read  Address 

Figure  A.5:  System  Controller  Instruction  Word  Components 

There  are  a  smedl  number  of  sequencer  operations,  any  one  of  which  can  be  specified  in  the 
SC.OpetBtion  field  of  the  system  controller  instruction  word  shown  in  Figure  A.5.  The  mnemonics  for 
the  sequencer  operations  are  given  as  follows,  along  with  a  description  of  how  each  operation  affects 
program  control  flow: 

1 .  C  JSR  (Conditional  Jump  to  SubRoutine).  If  the  condition  selected  in  field  f  1  is  true,  then  push 
(PC)+  1  onto  the  PC  stack,  write  the  branch  target  address  from  field  12  into  PC,  push  (1C)  onto 
the  1C  stack,  and  write  the  initial  iteration  count  value  fiom  field  13  into  1C.  If  the  condition 
selected  in  field  11  is  false,  then  write  (PC)+  1  into  PC. 

2.  LTST  (Loop  TeST).  If  the  condition  selected  in  field  11  is  true,  then  write  the  top  of  the  PC  stack 
into  PC,  pop  the  PC  stack,  write  the  top  of  the  1C  stack  into  1C,  and  pop  the  1C  stack.  If  the 
condition  selected  in  field  11  is  false,  then  decrement  1C  and  write  the  branch  target  from  field 
12  into  PC. 

3.  CBR  (Conditional  BRanch).  If  the  condition  selected  in  field  11  is  true,  store  the  branch  target 
from  field  12  into  PC.  If  the  condition  selected  in  field  11  is  false,  store  (PC)+  1  into  PC. 
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4.  CONT  (CONTinue).  Store  (PC)+  1  into  PC. 

5.  HALT.  Terminate  the  computation. 

If  the  system  controller  operation  specified  in  the  SC.OparMion  field  is  not  among  those  listed 
above,  the  sequencer  operation  is  taken  implicitly  to  be  CONT. 

For  sequencer  operations,  system  controller  instruction  word  field  f  1  controls  the  condition  selector 
ccmux.  The  mnemonic  values  for  this  field  are  as  follows,  along  with  their  interpretations; 

•  RSPO.  Select  input  0,  the  RESPONSE  ^-0  condition. 

•  FORC.  Select  input  1,  always  true. 

•  ICTO.  Select  input  2,  the  1C  =^0  condition. 

Of  the  fo\ir  fields  in  the  system  controller  instruction  word  shown  in  Figure  A.5,  only  the  first 
three  control  indexer  operations.  There  are  only  three  system  controller  index  register  subsystem 
operations  that  can  be  selected  in  the  SC-Operation  field.  The  mnemonics  for  these  operations  are 
given  as  follows,  along  with  descriptions  of  how  they  work: 

1.  LDX  (LoaD  indeX  register).  Write  the  value  contained  in  field  f2  into  the  index  register 
addressed  in  field  . 

2.  SPX  (SteP  index  register).  Increment  the  index  register  addressed  in  field  f1  by  the  value  in 
field  f2. 

3.  CPX  (CoPy  index  register).  Copy  the  contents  of  the  index  register  addressed  in  field  f2  into 
the  index  register  addressed  in  field  f1. 

Any  other  value  of  the  SC.Operatlon  field  leaves  all  index  register  file  locations  unaltered. 

A.4.2  PE  Machine  Code  Instruction 

The  PE  has  a  register-to-register  instruction  set  with  explicit  operations  used  to  access  o£f-PE*chip 
data  via  the  MCSs.  Each  PE  instruction  specifies  an  operation,  the  register  addresses  of  two  soiuces 
and  of  a  result  destination,  a  context  operation,  and  a  literal  base  value.  The  standard  "null” 
operation  is  NOjOP,  the  standard  "null”  register  address  is  NOJ.OC,  and  the  standard  "null”  context 
operation  is  NO.CO.  In  other  words,  a  machine  code  instruction  in  which  every  field  has  the  standard 
“null”  value  is  a  null  instruction  that  leaves  the  PE  state  unchanged. 

The  PE  machine  code  instruction  has  the  following  general  form: 

Dest  =  Operation  (SxcA,  Sr  cB) 

The  sources  and  destination  of  a  PE  machine  code  instruction  are  PE  registers,  so  the  machine 
code  instruction  specifies  a  register-to-register  operation.  The  PE  machine  code  instruction  includes 
the  fields  specified  in  Figure  A.6. 

The  set  of  PE  operations  is  large,  including  the  ordinary  two-operand  arithmetic  and  logical 
operations  as  well  as  operations  using  the  MCSs.  Following  standard  uniprocessor  design  prac¬ 
tice,  instruction  execution  is  pipelined.  The  PE  executes  instructions  in  the  three-stage  pipeline 
illustrated  in  Figure  A. 7. 

In  a  pipelined  PE,  some  instructions  begun  on  successive  PE  clock  cycles  exhibit  pipeline  haz¬ 
ards.  A  pipeline  hazard  arises,  for  example,  when  a  second  instruction  is  flow-dependent  on  the 
immediately  preceding  instruction.  (Such  a  pipeline  hazard  is  called  a  destination-source  pipeline 
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Figure  A.6:  PE  Machine  Code  Instruction  Word  Components 
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Figure  A.7;  PE  Execution  Pipeline 

conflict  [34]).  Some  pipeline  hazards  necessitate  the  second  instruction  being  delayed,  so  as  to  allow 
time  for  its  operand  to  be  produced  by  the  first  instruction.  Failiure  to  do  so  yields  an  incorrect  result. 
In  the  basis  computer,  the  PE  contains  no  pipeline  interlocks,  so  machine  code  programs  containing 
pipeline  hazards  are  not  allowed. 

An  MCS  operation  which  requires  more  than  one  subsystem  clock  cycle  to  be  performed  may  be 
phase-split:  An  initiating  instruction  starts  the  operation,  and  a  terminating  instruction  completes 
the  operation.  The  initiating  instruction  specifies  the  operands  and  the  operation  itself  while  the 
terminating  instruction  specifies  the  destination  into  which  to  write  the  result. 

Phase-splitting  in  machine  code  programs  expresses  the  overlap  of  high-latency  MCS  operation 
with  operations  on  other  MCSs  or  on  the  FU.  For  machine  code  programs  describing  generic  SIMD 
computations,  the  number  of  instructions  intervening  between  the  pair  specifying  a  phase-split  MCS 
operation  is  exactly  2  less  than  the  number  of  dock  cydes  required  for  the  operation.  Instructions 
intervening  between  the  pair  may  specify  FU  operations  or  operations  on  MCSs  other  than  the  one 
in  use  by  the  pair.  An  operation  specified  in  one  of  these  intervening  instructions  overlaps  with 
the  outstanding  operation.  Where  overlap  cannot  occur,  for  example  because  no  fiow-independent 
instruction  is  available,  the  time  between  the  initiating  and  terminating  instructions  is  spent  waiting 
for  the  long-d\iration  MCS  operation  to  complete. 

Some  MCS  operations  return  no  value  to  the  PE.  These  operations  include  local  external  memory 
store,  system  data  memory  store,  and  response.  These  operations  are  phase-split  using  only  a  single 
initiating  instruction.  Although  no  instructions  after  the  initiating  instruction  are  needed  to  perform 
the  operation,  no  subsequent  instruction  may  specify  another  operation  on  the  busy  MCS  until  the 
current  operation  completes. 

The  only  operation  using  the  instruction  broadcast  subsystem  itself  is  the  LITERAL  operation  that 
delivers  an  indexer-calculated  value  to  the  PEs.  This  operation  takes  no  source  operands.  Because 
literals  are  delivered  via  the  global  instruction  broadcast  network,  they  are  always  instruction 
delivery-rate  limited.  Therefore  even  if  its  latency  is  high,  a  LITERAL  operation  cannot  be  phase-split. 

The  fields  of  the  PE  instruction  shown  in  Figure  A.6  are  interpreted  as  follows: 

•  C  (Context).  This  field  specifies  a  context  operation  finm  the  following  set  (creation  of  a  new 
context  uses  an  operation  specified  in  the  Operation  field); 


NO.CO  No  context  operation. 
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-  FRC  Force  modification  of  this  instruction’s  Dest  irrespective  of  the  current  context. 

-  POP  Revert  to  the  previous  context. 

-  INV  Invert  the  sense  ofthe  current  context. 

-  CLR  Reset  context  to  its  power-up  state. 

•  Dest  (Destination).  This  field  specifies  the  PE  location  to  be  written  with  the  result  of  the 
instruction’s  operation  via  busC,  if  permitted  by  the  current  context.  Possible  destinations  are 
among  the  following: 

-  NOXOO.  Nowhere. 

-  REG.i.  Location  i  in  the  register  file. 

-  PTR  (Pointer).  The  register  file’s  POINTER  register. 

-  IND  (Indirect).  The  register  whose  address  is  contained  in  POINTER. 

•  SrcA  and  SrcB  (Source).  These  fields  specify  the  locations  providing  operands  via  busA  and 
busB,  respectively.  Possible  sources  are  as  follows: 

-  NOXOC.  No  oi)erand  is  required  on  this  bus  for  this  instruction. 

-  REG.i.  The  operand  is  read  firom  location  i  in  the  register  file. 

-  PTR  (Pointer).  The  operand  is  read  directly  firom  POINTER,  the  index  register  available 
for  addressing  the  register  file. 

-  IND  (Indirect).  The  operand  is  read  from  the  register  whose  address  is  contained  in 
POINTER. 

-  LIT  (Short  literal).  The  operand  is  supplied  as  a  literal  contained  in  this  instruction  field. 
(For  implementation  economy,  only  SrcA  can  be  iised  for  short  literals.) 

-  FU-OUT  The  operand  is  supplied  fium  the  FU  output  register. 

-  UT.OUT  The  operand  is  supplied  finom  the  UT.OUT  register,  containing  the  most  recently 
received  broadcast  literal  value. 

-  LEM-OUT  The  operand  is  supplied  fix)m  the  local  external  memory  circuit  output  register. 

-  COM-OUT  The  operand  is  supplied  fium  the  inter-PE  communication  circuit  output  reg¬ 
ister. 

-  lO-OUT  The  operand  is  supplied  firom  the  system  data  memory  circuit  output  register. 

•  Lit-Value.  This  field  carries  a  constant  to  be  stored  or  operated  upon  within  the  PE. 

•  Operation.  This  field  specifies  activity  in  the  FU  or  in  one  of  the  MCSs.  This  field  designates 
a  unit  to  perform  an  operation  and  controls  the  latching  of  that  unit’s  input  registers.  There 
are  three  sub-fields  of  the  Operation  field: 

-  Op-Code  This  sub-field  names  the  operation  to  be  performed,  and  by  implication  the  unit 
(FU  or  MCS)  that  performs  it.  When  an  operation  completes,  the  result  remains  in  the 
unit’s  output  register  until  a  subsequent  operation  is  performed  on  that  \jnit. 

A  list  of  possible  operation  codes  follows,  grouped  according  to  the  irnit  that  performs  the 
operation: 

*  FU:  PASS,  NOT,  AND,  OR,  NAND,  NOR,  XOR,  ADD,  SUB,  MULT,  DIV,  MOD,  LSHIFT,  and  RSHIFT. 
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*  FU  and  context  manager:  LCPUSHLT,  LCPUSHIE,  LCPUSHEQ,  LCPUSHRE.  LCPUSHjSE,  and 
LCPI^JST.  The  FU  subtracts  the  two  operands,  and  the  context  manager  generates  a 
new  context  on  the  basis  of  the  condition  codes  set  by  the  subtraction. 

*  Local  external  memory  subsystem:  LOAD,  LOAD.TX,  LOADJiX,  and  STORE. 

*  Inter-PE  communication  subsystem: 

•  Linear  array:  LDNO,  LDN0.TX.  LDNORX,  LUPO,  LUPO.TX,  and  LUPOJtX, 

•  Square  mesh:  SDNO,  SDNOJX.  SDNOJtX,  SUPO,  SUP0.TX,  SUPOJVC,  SDN1,  SDNI.TX,  SDNIPX.  SUP1, 
SUP1JX,andSUP1JtX, 

•  Cubic  mesh:  CDNQ,  CDNO.TX,  CDHOEX,  CUPO,  CUPOJX,  CUPOJVC,  CDN1,  CDN1.TX  CDN1PX,  CUP1, 
CUP1.TX,  CUPlJtX,  CDN2.  CDN2.TX,  CDHLRX,  CUP2,  CUP2.TX,  and  CUP2EX, 

•  Router-based  network:  ROUTE,  ROUTE.TX.  and  ROUTEJiX 

*  System  data  memory  subsystem:  K3XD,  IO.LD.TX,  lO^LDEX,  IO,ST 

*  Response  subsystem:  RESPOND 

*  Literal:  LITERAL 

-  Op-Cycle  This  field  tracks  the  step  index  in  a  multi-step  FU  operation.  Given  a  value  5-1 
on  the  first  step  of  an  5-step  sequence,  this  field  is  decremented  by  one  on  each  successive 
instruction.  The  steps  intervening  between  the  first  and  the  last  are  place  holders  for  the 
specific  operations  that  would  be  performed  on  an  actual  PE. 

-  newjop  This  one-bit  field  indicates  whether  the  input  registers  of  the  unit  designated  by 
the  Op-Code  should  latch  the  operands.  This  field  is  un-asserted  for  all  but  the  first  step  of 
a  multi-cycle  FU  operation  sequence. 


A.5  PE  Chip  Local  Controller 

Adding  I-cache  to  a  SIMD  computer  really  means  changing  the  local  controller  design.  The  local 
controller  in  the  PE  chip  of  a  generic  SIMD  computer,  shown  in  Figure  A.8,  is  very  simple. 


Figure  A.8:  Local  Controller  for  Generic  SIMD  Computer 

The  local  controller  of  a  generic  SIMD  computer  standardizes  the  system  clock  and  latches  broad¬ 
cast  instructions  for  use  within  the  PE  chip. 

Adding  I-cache  to  the  PE  chip  involves  a  multi-clock  generator  and  a  cache  controller.  The  multi¬ 
clock  generator  provides  aU  clocks  needed  in  the  PE  chip,  including  the  PE  clock.  The  PE  clock  rate 
exceeds  the  system  clock  rate,  so  there  are  multiple  cycles  of  the  PE  clock  per  cycle  of  the  system 
clock.  In  addition  to  controlling  cache  memory,  the  cache  controller  provides  a  new  instruction  for 
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From  Global  InatrucUon  Broadcast  Nakwork 


Figure  A.9:  Local  Controller  with  I-Cache 

local  broadcast  within  the  PE  chip  on  each  cycle  of  the  PE  clock.  A  local  controller  with  I-cache  is 
shown  in  Figure  A.9. 

A  consequence  of  the  variety  of  time  bases  within  the  PE  chip  is  that  correspondingly  clocked 
control  words  need  to  be  provided  to  each  of  the  separately  clocked  subsystems.  Figure  A.9  shows  a 
separate  control  latch  ( Jmntrol)  for  each  subsystem.  Note  however  that  providing  the  PEs  and  each 
of  the  MCSs  a  unique  clock  is  a  change  only  m  the  routing  of  clock  signals  to  those  subsystems,  rather 
than  a  change  in  the  logic  of  the  clocked  elements  themselves. 

The  model  allows  that  each  MCS  may  have  its  own  clock  rate.  It  is  possible,  however,  that  subsets 
of  the  MCS  clocks  are  unified.  For  example,  where  the  I/O  subsystem  and  the  response  subsystem 
both  operate  at  the  global  instruction  broadcast  rate,  both  those  MCSs  are  regulated  by  the  system 
clock,  just  as  they  are  in  the  generic  SIMD  computer.  Another  possibility  is  that  local  external 
memory  and  inter-PE  communication  operate  at  the  same  rate  as  the  PE’s  intra-chip  components, 
in  which  case  these  MCSs  are  regulated  by  the  PE  clock.  The  presentation  here  assumes  the  most 
general  case,  where  each  subsystem’s  clock  rate  is  xinique. 

The  PE  clock,  PE.CLK,  is  the  fastest  dock  in  the  PE  chip  and  in  the  computer.  PE.CLR  regulates  the 
components  that  are  integrated  entirely  within  the  PE  chip,  induding  the  local  controller,  the  local 
instruction  broadcast  network,  and  the  PEs.  'This  assertion  about  PE.CLK  reflects  an  assumption 
that  the  PE  chip  is  an  equipotential  region  (as  defined  in  [69])  within  which  signaling  delays  are 
negligible. 

There  are  5  docks  other  than  PE.CLK  in  the  computer; 

•  sys.CLK,  the  dock  regulating  the  global  instruction  broadcast  network  and,  for  simplidty,  the 
system  controller.  PE.CLK  runs  pi,  times  faster  than  SYS.CLK. 

•  LEM-CLK,  the  dock  regulating  the  local  external  memory  subsystem.  PE.CLK  runs  pi  times 
faster  than  LEH.CLK. 
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•  CON-CUC,  the  dock  regulating  the  inter>PE  communication  subsystem.  I  .^.CUC  runs  pc  times 
faster  than  COM.CIK. 

•  IO.CI1K,  the  dock  regulating  the  system  data  memory  subsystem.  FS.CLK  runs  pi  times  faster 
than  lO-CLK. 

•  RSP.CLK,  the  dock  regulating  the  response  subsystem.  PB.CLX  runs  pr  times  faster  than 
RSF.CLK. 

One  way  to  generate  high-rate  PE  docks  is  to  use  phase-locked  loop  (PLL)  techniques,  such  as 
those  described  in  [88].  PLLe  for  generating  high-rate  on-chip  docks  are  increasingly  frequent!} 
used  in  microprocessors  due  to  the  increasing  disparity  between  typical  intra-chip  and  inter-chip 
signaling  rates  [5, 88]. 

The  multi-dock  generator  used  in  the  basis  computer  outputs  a  set  of  docks  that  are  fiee-running 
and  in-phase  with  SYS.CUC.  The  p  value  assodated  with  each  subsystem  expresses  the  factor  by 
which  PE.CLK  is  faster  than  that  subsystem’s  dodc.  Section  3.9  introduces  a  p-set  that  characterizes 
a  SIMD  computer's  relative  rates  of  MCS  operation.  The  design  of  the  multi-dock  generator  imposes 
the  following  constraints  on  the  p-sets  that  characterize  the  SIMD  computers  for  which  I-cache  is 
evaluated: 

1.  Each  p-set  value  is  a  positive  integer  >  1,  and 

2.  pb  is  an  integer  multiple  of  every  p-set  value. 

These  constraints  mean  that  each  MCS  dock  rate  is  an  integer  sub-miiltiple  of  the  PE  clock  rate 
and  that  each  MCS  dock  rate  is  an  integer  multiple  of  the  system  dock  rate.  While  simplifying 
the  systematic  variation  of  the  p-sets  in  evaluating  I-cache  variants,  these  constraiots  unfortunately 
also  quantize  the  space  of  possible  p-sets.  An  example  of  this  quantization  is  that  if  the  PE  clock 
rate  is  A  times  higher  than  the  top  operation  rate  of  the  inter-PE  communication  subsystem  ipc-A), 
while  the  PE  dock  rate  is  B  times  higher  than  the  top  operation  rate  of  the  system  data  memor}’  L'O 
subsystem  (pi~B),  and  A  and  B  happen  to  be  mutually  prime,  then  the  PE  dock  rate  must  be  some 
multiple  ofA*B  times  faster  than  global  instruction  broadcast  (pb=nAB  for  some  n).  In  other  words, 
the  p  values  should  be  independent  variables,  but  they  are  made  inter-dependent  by  the  multi-clock 
generator  design. 

A.6  Changed  Globally  Broadcast  Instruction  Format 

The  multi-clock  generator  provides  a  time  base  inside  the  PE  chip  that  is  higher  than  the  time  base  of 
the  global  instruction  broadcast  subsystem  that  ultimately  controls  the  PE  chip.  So  that  the  globally 
broadcast  instructions  may  specify  activity  within  the  PE  chip  at  the  higher  temporal  resolution,  a 
new  held  is  added  to  the  broadcast  instruction.  This  field,  called  delayedJnstruction-delay .count  (or 
dldc  for  short),  specifies  a  number,  didc  is  the  number  of  PE  dock  cydes  for  which  the  local  controller  is 
to  wzdt  before  applying  the  destination  write-control  information  conveyed  in  a  broadcast  instruction. 
The  didc  field  is  necessary  because  it  is  possible,  depending  on  operation  stepcounts  and  the  p-set, 
for  an  MCS  operation  to  conclude  on  an  arbitrary  cycle  of  the  fast  PE  chip  clock.  The  didc  field 
essentially  provides  the  index  of  this  cyde,  such  that  a  result  returned  by  an  MCS  operation  is  stored 
in  the  PE’s  register  memory  on  the  cyde  it  becomes  available.  Note  that  the  addition  of  the  didc  field 
to  the  global  broadcast  instruction  word  does  not  affect  the  instruction  broadcast  locally  within  the 
PE  chip,  nor  does  it  affect  the  PE  itself 

Note  that  the  didc  field  is  needed  because  PE.CLK  is  faster  than  SYS.CLK,  not  because  of  I-cache 
itself  However,  the  cache  controller,  which  sequences  instructions  to  the  PEs,  also  interprets  the 
new  didc  field  appropriately. 
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A.7  Two  I-Cache  Variants:  Fq  and  F2 

The  F-family  of  single-port  caches  is  introduced  in  Section  4.2.  An  Fq  I-cache  is  the  simplest  family 
member,  able  to  store  only  a  single  cache  block  at  a  time,  and  able  to  execute  only  single  iterations  of 
the  stored  cache  block.  F2  is  a  shghtly  more  powerful  I-cache  variant,  still  able  to  store  only  a  single 
cache  block  at  a  time,  but  able  to  execute  multiple  iterations  of  the  block  without  assistance  from 
globally  broadcast  instructions. 

The  Fq  and  F2  cache-control  protocols  are  almost  identical,  each  including  the  following  four 
cache-control  instructions: 

•  CCJiOOP.  No  cache  control  operation. 

•  CCJBSTO  (Begin  STOring).  The  globally  broadcast  instruction  following  this  one  is  the  tust  in  a 
cache  block  about  to  be  stored.  That  next  broadcast  instruction  will  be  i  aced  at  address  0  in 
cache  memory,  with  subsequent  broadcast  instructions  stored  to  subsequent  locations  in  cache 
memory. 

•  CCPSTO  (End  STOring).  The  present  globally  broadcast  instruction  is  the  last  in  the  cache  block 
currently  being  stored.  The  instruction  specifying  CCPSTO  serves  as  a  sentinel  delimiting  the 
end  of  a  cache  block,  and  is  itself  placed  in  the  cache. 

•  CCPORK.  Activate  the  previoxisly  stored  cache  block. 

An  F2  I-cache  is  identical  to  an  Fq  I-cache,  except  that  an  F2  I-cache’s  CCPORK  instruction  also 
specifies  the  number  of  times  that  the  routine  is  to  be  iterated.  The  cache-control  instruction 
associated  with  any  globally  broadcast  instruction  not  in  the  above  list  is  interpreted  as  CCJtOOP. 

The  cache  controller  generates  the  control  signals  necessary  for  accessing  the  cache  memo*^  and 
selects  locally  broadcast  instructions  through  Imux.  On  each  PE.CLK  cycle,  the  cache  controller  is  in 
one  of  six  states.  The  intended  meanings  of  these  cache  control  states  are  described  below: 

•  LOCK  No  cache  block  is  active,  and  globally  broadcast  instructions  are  being  executed.  The 
cache  controller  supphes  p\t-\  “nxiU”  instructions  after  every  globally  broadcast  instruction 
received  while  in  the  LOCK  state.  In  this  state,  instructions  are  delivered  to  the  PEs  at  the 
same  rate  as  in  a  generic  SIMD  computer,  although  the  presence  of  fast  subsystem  clocks  allows 
some  subsystems  to  nin  faster  than  in  the  generic  SIMD  computer. 

•  BSTO  The  next  broadcast  instruction  will  be  the  first  of  the  cache  block  to  be  stored  in  cache 
memory. 

•  STOR  Globally  broadcast  instructions  are  being  stored  in  consecutive  cache  memory  locations. 
The  cache  controller  supplies  pb  “null”  instructions  on  every  PE  clock  cycle  while  in  the  STOR 
state.  No  useful  instructions  are  executed  by  the  PEs  in  this  state. 

•  ESTO  The  sentinel  CC.£STO  instruction  heis  been  received,  indicating  that  the  entire  cache  block 
has  been  stored  in  cache  memory. 

•  EXEC  A  cache  block  is  active.  The  cache  controller  supplies  an  instruction  from  cache  memory 
on  every  cycle  of  PE.CLK  in  this  state.  Up  to  pb  instructions  are  executed  by  the  PEs  during 
every  cycle  of  SYS-CLK  while  in  the  EXEC  state. 

•  JOIN  Execution  of  a  cache  block  has  completed,  but  the  subsequent  global  broadcast  instruction 
has  not  yet  arrived  at  the  PE  chip.  The  cache  controller  supplies  “null”  instructions  while  in 
the  JOIN  state.  Cycles  spent  in  the  JOIN  state  are  those  wasted  due  to  quantization;  at  most 
Pb  -  1  cycles  are  spent  in  the  JOIN  state  per  cache  block  activation. 
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Figure  A.10;  Fo  State  Transition  Diagram.  F2  state  transitions  are  slightly  more  complicated, 
because  the  iteration  counter  value  is  used  in  determining  completion  of  a  cache  block’s  execution. 


The  power-up  state  is  LOCK. 

The  diagram  in  Figure  A.10  shows  the  allowed  state  transitions  of  the  F©  cache  controller. 

The  labels  on  the  arcs  in  Figure  A.10  refer  to  the  following  values: 

•  Op  is  the  cache  operation  contained  in  the  current  instruction’s  Operation  field  Op.Code  subfield. 

•  index  is  the  current  phase  of  PE.CLK  with  respect  to  SYS.CLK,  which  is  the  value  in  the 
PE.CLKJndex  register. 

•  DONE  is  a  Boolean  value  indicating  completion  of  the  execution  of  a  cached  instruction  sequence. 

Figure  A.10  shows  that  cache  controller  state  changes  usually  occur  when  index  ==  0. 

Timing  delays  are  represented  as  sequences  of  “null”  instructions.  These  sequences  can  be 
expensive  both  in  terms  of  time  to  place  them  in  cache  as  well  as  in  terms  of  the  space  they  occupy 
in  cache.  Sequences  of  “null”  instructions  are  compactly  encoded  using  a  single  instruction  that 
causes  the  cache  controller’s  program  coimter  CPC  to  stall  for  a  number  of  cycles  equal  to  the 
number  of  encoded  “null”  instructions.  In  principle,  these  sequences  representing  timing  delays  are 
unnecessary,  although  some  equivalent  means  of  representing  delays  is  needed  in  any  case. 


From  Broadcast  Instruclion  Latch 
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Figure  A.11:  Fq  and  F2  Cache  Controllers 


To  _kontrol  registsrs 
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Figure  A.11  shows  the  Fq  and  F2  cache  controllers.  It  is  interesting  to  note  the  simplicity  of  the 
cache  controllers,  and  the  very  slight  change  needed  to  convert  an  Fq  I-cache  variant  into  an  F2 
I-cache  variant. 
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Appendix  B 

Assembly  Language  Programming  and 
Translation 


As  indicated  in  Figure  5.1  illustrating  the  method  used  to  evaluate  I-cache,  the  computations  solving 
the  sample  problems  were  described  using  assembly  language  programs.  The  assembly  language  is 
closely  related  to  the  machine  code  described  in  Appendix  A.  The  following  important  abstractions 
achieved  in  the  assembly  language  facilitated  the  programming  of  the  sample  problem  solutions: 

1.  The  assembly  language  programming  model  is  sequential,  abstracting  details  of  pipelined 
instruction  execution. 

2.  The  assembly  language  programming  model  abstracts  the  details  of  instruction  timing.  An 
assembly  language  program  specifies  a  sequence  of  operations,  without  regard  to  the  latencies 
of  individual  operations. 

3.  Assembly  language  programs  define  parameters  which  are  used  in  expressions  that  provide 
compile-time  literal  values. 

4.  Assembly  language  programs  define  labels  which  are  used  as  symbolic  branch  target  addresses 
in  system  controller  sequencer  instructions. 

These  abstractions  provide  convenience  in  describing  SIMD  computations.  For  example,  assembly 
language  program  parameters  allow  a  single  program  to  describe  a  set  of  operationally  structured 
SIMD  computations.  Typical  parameters  are  functions  of  problem  size  (N),  the  number  of  PEs  in  the 
system  (P),  or  the  number  of  PEs  per  PE  module  {K). 

This  appendix  motivates  the  choice  of  assembly  language  programming,  describes  the  language 
itself  explains  details  relating  to  re-programming  for  I-cache,  and  presents  some  of  the  interesting 
details  of  the  language  implementation. 

B.l  Assembly  Language  v.  High-level  Languages 

High-level  languages  used  to  describe  data-parallel  algorithms  typically  suppress  such  details  as  the 
target  system’s  inter-PE  communication  network  topology  or  the  number  of  PE  registers  [39, 41 ,  84]. 
However,  the  details  do  affect  the  operational  structure  of  a  computation.  For  example,  the  inter-PE 
communication  network  topology  determines  the  set  of  available  inter-PE  communication  subsystem 
operations.  As  another  example,  maintaining  a  high-level  language  program  variable  in  local  ex¬ 
ternal  memory  requires  local  external  memory  access  operations  that  would  not  be  needed  were  the 
variable  maintained  in  a  register.  It  is  the  properties  of  the  operational  structiire  of  a  computation 
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that  determine  I-cache  speedup.  In  describing  a  sequence  of  FU  and  MCS  operations  performed 
by  the  PE,  an  assembly  language  program  makes  the  details  of  computation’s  operational  structure 
explicit.  This  characteristic  distinguishes  assembly  language  from  the  high-level  languages. 

The  assembly  language  is  not  notable  for  the  conciseness  of  algorithmic  expression  that  it  en¬ 
genders.  The  emphasis  is  on  exphdtness  of  operation  sequences  corresponding  to  physical  activity 
in  a  SIMD  computation.  There  are  no  virtW  processors  and  no  virtual  memory;  there  are  only 
representations  of  real  PEs,  real  networks,  and  real  memory  required  for  detailed  I-cache  speedup 
measurements. 

Of  course,  it  is  possible  to  compile  into  assembly  language  from  a  high-level  language.  It  seemed  at 
the  outset  of  this  work  that  the  number  of  examples  would  be  small  enough,  and  the  details  important 
enough,  to  merit  writing  assembly  language  programs  by  hand.  In  fact,  coding  efforts  often  began 
with  a  representation  of  the  PE  operation  sequence  in  a  C-like  pseudo-code,  subsequently  hand- 
translated  into  assembly  language.  While  assembly  language  facilitated  producing  the  ultimate 
machine  code  programs  needed  for  the  simulations,  '  -^maU  number  of  simple  abstractions  were 
easier  to  implement  efficiently  than  the  high-level  pi  mming  abstractions  that  are  themselves 
the  subject  of  many  a  dissertation  would  have  been. 


B.2  Assembly  Language  Syntax 

This  section  provides  an  overview  of  the  assembly  language.  Appendix  C  shows  the  derivation  of 
an  example,  and  the  assembly  language  program  in  Figure  C.3  may  be  a  helpful  illustration  of  the 
syntax  described  here. 

The  first  line  of  an  assembly  language  program  declares  the  program’s  parameter  names.  Each 
remaining  line  specifies  a  label  or  one  statement.  Each  statement  associates  a  system  controller 
instruction  with  a  PE  instruction,  although  either  part  of  a  statement  may  be  left  blank.  A  “!” 
character  denotes  the  beginning  of  a  comment  that  runs  to  the  end  of  the  current  line. 

Each  system  controller  operations  associated  with  register-to-register  PE  operations.  The  assem¬ 
bly  language  syntax  for  baseline  computations  is  simple:  each  line  contains  a  statement  or  a  label. 
A  statement  has  a  system  controller  part  and/or  a  PE  part,  separated  by  a  semicolon,  as  follows: 

systeoLcontroller-part  ;  PE-part 

B.2.1  System  Controller  Instruction 

The  system  controller  part  of  a  line  specifies  a  system  controller  instruction  along  with  parameters 
for  that  instruction.  The  format  of  the  system  controller  instruction  closely  follows  that  of  the 
corresponding  machine  code  instruction  shown  in  Figure  A.5,  with  the  exception  that  conditions  are 
specified  mnemonically,  branch  targets  are  labels  instead  of  absolute  addresses,  and  iteration  counts 
are  specified  as  expressions  contained  within  matched  single-quote  characters. 

B.2.2  PE  Instruction 

The  PE  part  of  a  statement  specifies  a  PE  instruction  of  the  form: 

[Con-text]  Dest  =  Operation  (SrcA,  SrcB) 

The  Context  field  specifies  a  context  operation,  Dest ,  SrcA,  and  SrcB  specify  register  addresses, 
and  Operation  names  an  FU  or  MCS  operation.  The  operations  are  listed  in  Section  A.4.2.  Any 
field  omitted  in  the  PE  part  of  a  line  takes  its  standard  “null”  value. 
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B.3  Assembly  Lan^age  Re-Programming  for  I-Cache 

I-cache  requires  that  a  small  number  of  cache-control  instructions  be  added  to  the  set  of  globally 
broadcastable  machine  code  instructions.  Cache-control  instructions  direct  the  storing  of  cache 
blocks  and  their  subsequent  retrieval  from  cache. 

I-cache  uses  two  additions  to  the  assembly  language.  One  addition  is  the  inclusion  of  cache- 
control  instructions  among  the  set  specified  in  the  PE  part  of  an  assembly  language  statement.  The 
three  cache-control  instructions  for  Fo  and  F2  I-cache  variants  are  CCBSTO  and  CCJESTO,  delimiting 
cache  block  preambles,  and  CCFORK,  activating  a  previously  stored  cache  block. 

Another  addition  to  the  assembly  language  is  a  special  instruction  form  called  a  ‘‘FORK  construct”. 
The  FORK  construct  allows  the  programmer  to  associate  a  CCJORK  operation  with  the  cache  preamble 
that  will  have  been  executed  prior  to  the  CCJORK  itself  This  construct  firees  the  programmer  from 
having  to  determine  the  diu^tion  of  cache  block  execution.  The  assembler/scheduler  transforms  each 
FORK  construct  into  a  CCJORK  operation  in  the  resulting  machine  code  program  and  caxises  the  system 
controller  to  execute  a  wait  loop  for  the  duration  of  the  cache  block’s  execution. 

Figure  C.5  shows  the  program  fix)m  Figure  C.3  adapted  to  use  an  Fo  I-cache  variant.  Figure  C.6 
shows  the  program  adapted  to  use  an  F2  I-cache  variant.  The  difference  between  the  two  programs 
is  that  the  F2  program  associates  an  iteration  count  with  the  FORK  construct,  which  is  used  in  the 
CCJORK  instruction  activating  the  F2  cache  block. 


B.4  Scheduler 

B.4.1  Basic  Block  Definition 

To  simplify  its  implementation,  the  scheduler  performs  code-motion  optimizations  only  within  basic 
blocks.  In  ordinary  programming  languages,  basic  blocks  are  defined  as  sequences  of  necessarily 
sequentially  executed  instructions,  or  “straight-line”  code;  only  the  first  statement  in  a  basic  block 
C8in  be  a  branch  target  in  the  program’s  execution,  while  only  the  leist  statement  of  the  basic  block 
can  result  in  a  branch  being  taken  [29]. 

Compiler  optimizations  that  overlap  operations  along  multiple  independent  pathways  are  easiest 
to  perform  within  basic  blocks  [49].  Flow  graph  analysis  complexity  increases  exponentially  with  the 
number  of  possible  outstanding  branch  operations  one  is  willing  to  consider  simultaneously.  While 
it  is  certainly  possible  to  optimize  across  basic  blocks  in  some  cases  [29],  general  solutions  can  be 
prohibitively  difficult. 

The  scheduler’s  definition  of  a  basic  block  is  augmented  firom  the  conventional  definition:  Not 
only  do  labels  and  conditional  branches  delimit  basic  block  boimdaries,  but  PE  context  management 
instructions  delimit  basic  block  boimdaries  as  well. 

B.4.2  Pipeline  Optimization 

A  PE  instruction  is  at  risk  of  a  pipeline  hazard  if  either  of  its  operands  is  a  PE  register.  A  pipeline 
hazard  arises  when  one  of  an  instruction’s  source  registers  is  written  by  a  d)Tiamically  preceding 
instruction.  In  pipelined  execution,  the  preceding  instruction  will  not  have  written  the  register  by 
the  time  it  is  read  for  this  instruction.  A  straightforward  transcription  of  the  assembly  language 
program  would  thus  yield  incorrect  computation  results. 

A  conservative  solution  to  a  pipeline  hazard  is  to  delay  the  second  instruction,  stalling  the  pipeline 
by  inserting  a  NOOP  into  the  machine  code  program.  This  delay  ensures  that  the  second  instruction 
obtains  the  correct  register  value.  However,  in  some  cases  pipeline  hazards  can  be  repaired  so  that 
no  cycles  are  wasted  on  pipeline  stalls. 
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The  scheduler  detects  pipeline  hazards  by  examining  each  instruction’s  sources  and  comparing 
them  against  each  potentially  preceding  instruction’s  destination.  If  either  of  this  instruction’s 
source  registers  is  the  same  as  any  of  the  preceding  instructions’  destination  registers,  a  pipeline 
hazard  exists.  If  all  preceding  instructions’  PE  operations  write  the  same  register  and  all  of  those 
operations  use  the  FU  or  the  same  MCS,  then  a  pipeline  stall  is  avoided  by  replacing  this  instruction’s 
source  reference  with  a  reference  to  the  output  register  of  the  subsystem  used  by  those  preceding 
instructions.  Otherwise,  a  pipeline  stall  needs  to  be  inserted  so  that  this  instruction  can  execute 
correctly. 

B.4.3  Phase-Splitting 

Phase-splitting  is  a  widely  \ised  technique  for  overlapping  PE  operations  along  high-latency  PE- 
extemal  pathways  with  PE-intemal  operations.  The  basic  idea  of  phase-sphtting  is  to  re-cast  the 
high-latency  operation  as  a  pair  of  low-latency  operations,  one  initiating  and  one  terminating  activity 
on  the  high-latency  pathway.  The  initiating  and  terminating  operations  must  be  separated  by  a  fixed 
number  of  instructions  in  the  program  executed  by  the  PE.  Phase-splitting  makes  it  possible  for 
operations  on  other  data  pathways  to  overlap  with  the  high-latency  operation.  Where  such  overlap 
cannot  occur,  for  example  due  to  flow-dependencies  in  a  program,  time  is  spent  idle  waiting  for  the 
result  from  the  high-latency  operation. 

A  detailed  discussion  of  this  concept  in  the  context  of  multiprocessor  PE  memory  reads  through 
high-latency  networks  is  foimd  in  [54]:  Where  possible,  reads  are  initiated  well  in  advance  of  the 
instruction  requiring  the  referenced  value. 

The  introduction  of  I-cache  and  its  concomitant  fast  PE  dock  into  the  PE  chip  means  that  some 
MCS  operations  which  have  single-cycle  latency  in  generic  SIMD  computers  (such  as  neighbor  com¬ 
munications  in  SLAP  [27]  and  in  MPP  [4]),  become  high-latency  operations  in  I-cached  SIMD  com¬ 
puters.  High-latency  MCS  operations  should  be  overlapped  where  possible  with  each  other  and  with 
FU  calculation.  For  example,  a  regular  neighbor  conununication  instruction  is  phase-split  into  an 
initiating  instruction  transmitting  a  register  value  through  the  inter-PE  communication  subsystem 
and  a  terminating  instruction  storing  into  a  register  a  corresponding  value  received  fix>m  that  sub¬ 
system.  In  SLAP,  local  external  memory  references  always  have  latency  greater  than  1  and  thus  are 
phase-split  [39]. 

PE  FU  instructions  for  the  basis  computer  cannot  be  phase-split  because  they  are  require  control 
information  on  every  cyde.  In  machine  code  programs,  this  assumption  is  reflected  by  the  presence 
of  place  holder  instructions  for  all  but  the  first  and  last  instructions  of  a  sequence;  the  initiating  and 
terminating  operations  of  a  sequence  carry  meaningful  source  and  destination  information,  while 
the  intervening  instructions  cany  out  the  steps  of  the  FU  operation. 

Phase-split  MCS  instructions  are  not  allowed  in  assembly  language  programs.  The  scheduler 
phase-splits  high-latency  MCS  instructions,  and  it  subsequently  reorganizes  basic  blocks  to  overlap 
those  high-latency  instructions  with  other  instructions  where  possible.  An  example  of  phase-splitting 
is  shown  in  the  left-hand  side  of  Figure  B.l. 

When  an  assembly  language  statement’s  PE  instruction  is  phase-split  and  the  statement  also 
specifies  a  system  controller  sequencer  instruction,  the  sequencer  instruction  is  assodated  writh 
the  terminating  node  of  the  phase-split  instruction.  If  the  statement  specifies  a  system  controller 
indexer  instruction,  the  indexer  instruction  is  assodated  wdth  the  initiating  node  of  the  phase-split 
instruction. 

B.4.4  Code  Reorganizer 

The  reorganizer  attempts  to  overlap  phase-split  MCS  instructions  with  each  other  and  wdth  FU 
instructions.  The  reorganizer  operates  on  the  flow  graph  wdthin  basic  blocks,  performing  conservative 
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Figure  B.l:  Phase-Splittmg  and  Operation  Overlap 


greedy  local  re-ordering. 

An  example  of  the  result  of  the  reorganizer  result  is  shown  on  the  right-hand  side  of  Figure  B.l. 
In  the  example,  the  LIT  and  400  instructions  overlap  with  the  first  LQAO  instruction,  while  no  overlap 
is  possible  for  the  second  LCMO  instruction. 

Note  that  if  the  reorganizer  were  excluded,  the  scheduled  code  would  be  correct  but  possibly 
inefficient.  The  flow  graph  input  to  the  reorganizer  represents  as  phase-split  all  high-latency  MCS 
instructions.  The  reorganizei^s  goal  is  to  move  the  first  phase  of  a  phase-split  operation  up  in  the 
flow  graph  (and  thus  earlier  in  the  schedule)  and  to  move  the  second  phase  of  a  phase-split  operation 
down  in  the  flow  graph  (and  thus  later  in  the  schedule).  The  instructions  associated  with  the  flow 
graph  nodes  that  intervene  the  two  nodes  of  the  phase-split  instruction  overlap  with  that  phase-split 
operation. 

A  constraint  on  the  reordering  of  flow  graph  nodes  is  that  results  fimm  MCS  instructions  must  be 
retrieved  firom  the  PE’s  MCS  interface  registers  on  the  cycle  in  which  they  become  available.  In  other 
words,  the  executions  of  the  initiating  instruction  and  the  terminating  instruction  of  a  phase-split 
pair  must  be  separated  by  a  fixed  amount  of  time.  This  constraint  is  met  by  the  flow  graph  input  to 
the  reorganizer,  because  no  nodes  intervene  the  two  nodes  of  a  phase-split  instruction.  The  scheduler 
inserts  explicit  delays  required  when  the  instruction  slots  between  the  two  phases  of  a  phase-split 
instruction  are  not  used  for  other  instructions. 

The  initiating  phase  of  a  phase-spht  instruction  has  the  same  name  as  the  original  instruction 
with  the  suffix  “JnC,  while  the  terminating  phase  has  the  stiffix  The  reorganizer  traverses  a 
basic  block  \mtil  it  finds  a  .TX  instruction.  When  such  an  operation  is  fovmd,  the  .TX  phase  instruction 
is  swapped  up  as  far  as  possible,  and  then  the  associated  J7X  phase  instruction  is  swapped  down  as 
far  as  possible. 

It  is  possible  to  swap  the  .TX  phase  statement  (called  T)  with  its  preceding  statement  in  the  flow 
graph  (called  C)  when  the  following  conditions  obtain  (where  Ts  J?X  statement  covmterpart  is  called 

R): 

1.  C’s  PE  instruction  \ises  a  difierent  MCS  fix)r>i  that  used  by  Ts  PE  instruction,  or  C’s  PE 
instruction  uses  the  FU; 
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2.  Neither  sovirce  of  Ts  PE  instruction  is  the  destination  of  C’s  PE  instruction; 

3.  T  and  C  do  not  specify  system  controller  indexer  instructions  that  exhibit  index  register  flow 
dependencies; 

4.  There  are  enough  time  available  between  T  and  R  to  accommodate  the  duration  of  C’s  PE 
instruction; 

5.  If  C  is  an  .RX node,  there  is  enough  time  available  between  C  and  its  predecessor  to  accommodate 
the  duration  of  the  phase-split  instruction  represented  by  T  and  R. 

It  is  possible  to  swap  the  JRX  phase  statement  (R)  with  its  succeeding  statement  in  the  flow  graph 
(C)  when  the  following  conditions  obtain  (where  R’s  .restatement  counterpart  is  T): 

1.  C’s  PE  instruction  uses  a  different  MCS  from  that  used  by  R’s  PE  instruction,  or  C’s  PE 
instruction  uses  the  FU; 

2.  Neither  source  of  C’s  PE  instruction  is  the  destination  of  R’s  PE  instruction; 

3.  R  and  C  do  not  specify  system  controller  indexer  instructions  that  exhibit  index  register  flow 
dependencies; 

4.  There  is  enough  time  available  between  T  and  R  to  accommodate  the  diiration  of  C’s  PE 
instruction; 

5.  If  C  is  a  .TX  node  that  has  an  JtX  node  associated  with  it,  there  is  an  instruction  slot  available 
between  C  and  its  successor  to  accommodate  R. 

B.4.5  Calculating  I>cache  Speedups 

The  determination  of  whether  to  use  a  cachable  instruction  sequence  as  a  cache  block  rests  on  the 
balance  between  the  extra  time  taken  to  store  the  cache  block  and  the  time  saved  by  executing  the 
instruction  sequence  ffnm  cache.  If  the  former  exceeds  the  latter,  then  the  sequence  should  not  be 
cached.  If  the  latter  exceeds  the  former,  then  the  sequence  might  be  cached,  so  long  as  so  doing  does 
not  disadvantage  other,  more  profitably  cached  sequences. 

Cachable  instruction  sequences  whose  executions  alternate  during  the  computation  compete  for 
cache  space.  In  single-block  I-cache  variants,  including  F©  and  F2,  caching  both  of  a  pair  of  such 
sequences  necessitates  re-storing  them  as  they  are  needed.  Therefore  the  determination  in  general 
of  which  instruction  sequences  to  store  can  be  arbitrarily  complicated,  depending  as  it  does  on 
the  degree  of  conflict  among  candidate  sequences.  Such  conflicts  arise  also  for  multi-block  I-cache 
variants  due  to  the  limited  capacity  of  cache  memory. 

In  the  basis  computer,  each  MCS  has  its  own  local  control  within  the  PE  chip.  Instructions  are 
needed  only  to  initiate  MCS  operations  and  to  terminate  them  by  directing  the  storing  of  returned 
results  in  PE  registers.  The  FU  is  different  finm  the  MCSs,  in  that  it  requires  a  new  instruction  on 
every  clock  cycle  of  its  operation;  the  operation  of  the  FU  is  instruction  delivery  rate-bound.  Some 
actual  SIMD  computers  diverge  firom  this  model.  An  example  of  such  divergence  is  found  in  the 
SLAP  local  controller,  which  manages  the  steps  of  a  multiplication  autonomously  after  receiving  an 
initiating  instruction.  Another  example  of  divergence  from  the  model  is  found  in  the  CM-1  inter-PE 
communication  subsystem,  which  requires  broadcast  control  on  each  clock  cycle  of  its  operation  [42], 

The  following  are  some  general  observations  regarding  cache  speedups  for  the  basis  computer. 
These  observations  are  used  hexiristically  in  the  scheduling  algorithm; 
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1 .  A  cachable  sequence  of  FU  instructions  that  does  not  overlap  with  any  MCS  operation  always 
executes  pb  times  faster  fiom  cache.  The  calculation-intensiveness  of  the  problem,  the  circuit 
complexity  of  the  PE  relative  to  the  widths  of  problem  data  words  (in  bits),  and  to  a  limited 
extent  the  number  of  PE  registers  determine  the  lengths  of  such  sequences. 

2.  A  cachable  sequence  of  instructions  using  a  single  MCS,  where  no  other  instruction  is  available 
to  overlap  with  those  ir  the  sequence,  may  or  may  not  execute  faster  from  cache. 

The  I-cache  speedup  is  subject  to  quantization  and  so  depends  on  whether  the  durations  of  the 
instructions  in  the  sequence  happen  to  be  multiples  of  pb-  When  em  instruction’s  latency  is  not 
a  multiple  of  pb.  tlien  in  the  generic  SIMD  computation  there  is  some  slack  time  between  the 
completion  of  the  instruction  and  the  arrival  of  the  next  broadcast  instruction.  Delivering  such 
an  instruction  sequence  &om  cache  allows  instructions  to  be  provided  at  the  MCS’  operation 
rate.  The  greater  temporal  resolution  of  MCS  control  within  the  PE  chip  gained  by  I-cache 
saves  that  otherwise  wasted  slack  time. 

This  time  savings  is  significant  in  some  instances.  For  example,  consider  a  sequence  of  N  inter- 
PE  commiinication  instructions  on  a  linear  array,  such  as  might  arise  in  the  inner  loop  of  an 
image  motion-compensation  program.  Assume  that  the  linear  array  communication  instruction 
takes  a  single  clock  cycle.  The  sequence  requires  N  globally  broadcast  instructions,  and  N 
system  clock  cycles,  to  complete  in  the  generic  SIMD  computer.  If  shifting  on  a  linear  array 
could  occur  at  twice  the  rate  of  global  instruction  broadcast,  then  the  sequence  can  be  executed 
from  I-cache  in  just  f^l  system  clock  cycles,  resulting  in  I-cache  speedup  of  approximately  2. 

In  general,  a  sequence  of  single-cycle  instructions  on  MCS  X  executes  faster  fix>m  I-cache  by  a 
factor  of  A  sequence  of  5-cycle  operations  on  MCS  X  executes  faster  firom  I-cache  by  the 
following  factor: 


S»PX 

Ph 


From  this  equation,  it  is  dear  that  where  5  *  px  is  a  multiple  of  Pb,  there  is  no  I-cache  speedup. 
The  maximum  I-cache  speedup  is  1  +  s^[- 

3.  A  cachable  sequence  of  overlappable  instructions  using  disparate  MCSs  may  or  may  not  execute 
faster  fix)m  cache. 

The  I-cache  speedup  depends  on  the  instruction  latendes,  as  well  as  on  the  order  in  which  they 
are  executed.  As  a  simplest  example  of  this  phenomenon,  consider  the  execution  of  a  pair  of 
overlappable  instructions,  and  $,  that  use  different  MCSs.  If  one  instruction’s  latency  is 
significantly  greater  than  the  other's,  then  the  longer  instruction  determines  the  time  taken 
to  execute  the  pair.  A  good  compiler  would  schedule  the  longer  operation  first,  so  there  would 
be  no  I-cache  speedup.  However,  if  fl  and  $  are  of  roughly  similar  duration,  say  within  one 
broadcast  instruction  interval  of  one  another,  then  there  may  be  some  I-cache  speedup.  Assume, 
for  example,  that  both  fi  and  $  have  latency  equal  to  one  broadcast  instruction  interval.  Then 
the  generic  SIMD  computer  uses  exactly  two  broadcast  instructions  to  execute  fl  followed  by 
$.  When  execnited  firom  cache,  the  instruction  starting  $  can  be  issued  as  early  as  possible 
following  the  one  starting  12,  and  the  time  between  those  instructions  may  be  less  than  that 
between  globally  broadcast  instructions.  Quantization  causes  a  one-iteration  pass  through  a 
sequence  containing  only  these  two  instructions  to  be  no  faster  from  I-cache  than  from  the 
broadcast  network.  However,  an  iterated  lcx)p  containing  just  those  two  instructions  may 
execute  faster  fix)m  I-cache,  as  might  a  single  pass  through  a  longer  sequence  containing  these 
two  instructions  as  a  subsequence. 
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APPENDIX  B.  ASSEMBLY  LANGUAGE  PROGRAMMING  AND  TRANSLATION 


Appendix  C 

Illustrated  Example  of  Speedup 
Measurement 


I'Cache  speedup  is  obtained  by  comparing  throughput  of  computations  on  SIMD  computer  variants. 
A  generic  SIMD  computation  is  simulated  and  its  output  data  examined  to  verify  the  correctness  of 
the  resiilt.  A  computation  with  I-cache  is  then  also  simulated  and  verified.  The  ratios  of  system  clock 
cycle  coiints  yield  the  I-cache  speedup.  Another  interesting  statistic  produced  by  the  simulations  is 
the  cache  sixe  required  to  attain  the  reported  speedup.  Measinrements  are  taken  for  varying  problems, 
varying  underlying  PE  chip  designs,  and  varying  system  network  characteristics,  to  yield  a  picture  of 
the  tradeoff  stakes  in  portions  of  the  large  design  space.  This  appendix  presents  a  detailed  example 
of  how  the  basis  computer  is  used  to  describe  a  computation  and  measure  its  I-cache  speedup. 


C.l  Operationally  Structured  Computation 

This  subsection  discusses  aspects  of  creating  the  assembly  langxiage  program  describing  the  generic 
SIMD  computation  that  is  the  starting  point  for  I-cache  evaluations. 


High-Level  Algorithm 

The  derivation  of  a  generic  SIMD  computation  begins  with  a  high-level  algorithm  solving  a  scalable 
data-parallel  problem.  A  data-parallel  problem  is  be  defined  as  a  set  of  roughly  independent  sub¬ 
problems  such  that  the  sub-problems  can  be  solved  concurrently.  A  high-level  algorithm  solving  a 
data-parallel  problem  defines  a  sequence  of  operations  to  be  performed  by  each  PE  that  yields  the 
answer  to  each  sub-problem. 

As  an  example,  consider  the  following  high-level  algorithm  for  multiplying  two  square  matrices 
of  dimension  P: 


p-i 


VrOW=0  ...  P  —  1  I  Vcol*0  .  -  -  P  1  I  CVow,col*  ^  ]  ■^row,acol  *  ^aool.col 

\  acol»0 


(C.l) 


The  solution  of  each  independent  sub-problem  represented  by  each  element  of  the  result  matrix 
C  is  given  by  an  accumulated  sum  of  products.  The  algorithm  in  Equation  C.l  specifies  the  sequence 
mathematical  operations  required  to  solve  the  sub-problem  represented  by  each  element  of  the  result 
matrix. 
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APPENDIX  C.  ILLUSTRATED  EXAMPLE  OF  SPEEDUP  MEASUREMENT 
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Figure  C.l:  Square  Matrix  Layout  on  a  P-Element  Linear  Array 


Topology-Specific  Algorithm 

The  next  step  in  the  derivation  of  the  generic  SIMD  computation  transforms  the  high-level  algorithm 
into  a  slightly  less  high-level  description  of  its  implementation  on  a  computer  having  a  specific  inter- 
PE  communication  topology.  This  step  involves  identifying  an  inter-PE  communication  network 
topology  and  assigning  the  sub-problems  to  the  PEs.  Together,  the  high-level  algorithm,  the  inter-PE 
communication  topology,  and  the  mapping  of  sub-problems  to  PEs  define  the  set  of  mathematical 
operations  performed  within  each  PE  as  well  as  the  inter-PE  communication  necessary  for  the  PE  to 
obtain  operands. 

The  ideal  mapping  for  a  given  high-level  algorithm  on  a  given  inter-PE  communication  topology  is 
not  easy  to  ascertain  in  general.  Automatic  means  for  doing  so  is  the  subject  of  current  research  [84]. 
For  the  purposes  of  experimentation,  it  suffices  to  have  reasonable  mappings  for  the  problem/topology 
combinations  of  subject  computations.  Experimentally  adding  I-cache  to  realistic  computations  does 
not  require  a  general  automatic  solution  to  the  mapping  problem. 

As  an  example  of  this  mapping  step,  consider  the  square  matrix  multiplication  on  a  linear  array.  A 
P-element  linear  array  might  use  the  mapping  depicted  in  Figure  C.l.  This  simple  mapping  assigns 
each  PE  a  column  of  each  matrix.  A,  B,  and  C,  such  that  PE  index  tt  contains  the  sets  of  matrix 
elements  {ACi,  tt]  |  i=O...P  —  1],  {j5[j,7r]  |  j=O...P  -  1},  and  {C[i,  1 1=0. .  .P  -  1). 

The  high-level  algorithm  in  Equation  C.l  and  the  mapping  sketched  in  Figure  C.l  together  imply 
that  each  PE  index  tt  solves  the  set  of  sub-problems  described  in  Equation  C.2.  The  required  inter-PE 
communication  is  evident  in  Equation  C.2  in  the  references  to  Arow.acoi^  for  acol  ^tt,  PE  index  tt  needs 
to  access  Arow,acoi>  which  is  mapped  to  PE  index  acol. 


Vrow=0  ...P  —  1 


p-i 


Crow.fl-®  y  ]  Arow.acol  *  f^acol.r 
acol=0  J 


(C.2) 


The  commiinication-explicit  algorithm  at  this  stage  of  the  derivation  is  represented  as  a  sequential 
program,  for  example  that  shown  in  Figure  C.2.  Explicit  at  this  stage  of  the  derivation  are  the  set  of 
variables  resident  in  each  PE,  the  set  of  operations  applied  within  each  PE  to  its  resident  variables, 
amd  the  inter-PE  communication  required  for  operand  access. 


C.2.  SUMMARY  OF  GENERIC  SIMD  COMPUTER  PARAMETERS 
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linear-array-sqoaxttjuatrlxjmaltlply  (A,  B,  C,  P) 
int  P; 

element  >l[Pl  [P]  ,  BIP]  [Pi,  C[P1  [P]  ; 

{ 

int  i,  k; 

PE  index  w; 

for  (i  =  0  upto  P) 

{ 

forall  (tt  €  {0  iqito  P})  in  parallel 

{ 

C[i,7r]  =  AIt,Tl  ♦fl[x,7rl ; 
for  (A:  =  1  upto  P) 

C[*,rl  +SS  A[i,7r  +  fc  (mod  P)  1  ♦  P(ir  +  A:  (modP),^]; 

} 

} 

} 


Figure  C.2:  linear  Array  Matrix  Multiply 
Assembly  Language  Program 

Given  the  preceding  algorithm  with  explicit  inter-PE  communication,  the  following  is  a  hst  of  some 
of  the  transformations  appUed  to  yield  the  assembly  language  program  describing  an  operationally 
structiired  generic  SIMD  computation: 

•  Conditional  control  structures  are  transformed  into  PE  context  management  operations. 

•  Loops  are  transformed  into  subroutine  calls  with  explicit  iteration  counts. 

•  Loop-index-dependent  expressions  from  the  high-level  language  description  are  partitioned  into 
two  sets:  some  loop-index-dependent  expressions  are  evaluated  by  the  system  controller  and 
broadcast  to  the  PEs,  while  the  rest  are  evaluated  in  the  PEs.  Although  evaluation  of  these 
expressions  in  the  PEs  is  redundant  (since  a  loop-index-dependent  expression  has  the  same 
value  in  all  PEs),  it  is  sometimes  more  efficient  to  evaluate  these  expressions  on  PEs  than  it  is 
on  the  system  controller. 

•  Variables  are  assigned  locations  in  PE  registers  and  local  external  memory,  and  variable  refer¬ 
ences  are  converted  into  calculations  of  addresses  followed  by  local  external  memory  accesses. 

As  an  example,  Figure  C.3  contains  an  assembly  language  program  for  a  generic  SIMD  computa¬ 
tion  for  square  matrix  multiplication  on  a  linear  array.  The  parameters  SA.pir  &C-pi  are 

the  addresses  of  matrix  column  bxifrers  in  local  external  memory. 

C.2  Summary  of  Generic  SIMD  Computer  Parameters 

The  following  parameters  are  used  in  deriving  physical  descriptions  of  SIMD  computation  from 
operationally  structvired  descriptions: 
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APPENDIX  C.  ILLUSTRATED  EXAMPLE  OF  SPEEDUP  MEASUREMENT 


program  Ismo-aas  (&A-pl,  tCjpL,  P)  : 

ZJ3X  ZRO  '0'  ;  R2  =  LITERAL ('£A4>1  -  1') 

LDX  IR2  '-1'  ; 

CJSR  PORC  IKPUT  'P-1'  ; 

LDX  ZRO  '0'  ;  R2  =  LITERAL ( '£B^pi  -  1') 

LDX  IR2  'P-1'  ; 

CJSR  FORC  INPUT  'P-1'  ; 

LDX  IRO  '0'  ;  R1  =  LITERAL ( '£Api' } 

LDX  IRO  '0'  ;  R7  =  LITERAL < 'SB^i' ) 

LDX  IRO  '0'  ;  R8  »  LITERAL (' &B^i  -f  P') 

U>X  IRO  '0'  ;  R3  s  LITERAL (' £Clpi ' ) 

CJSR  PORC  OUTER  'P-1'  ; 

LDX  IRO  '0'  ;  R2  s  LITERAL ( '£C4>i  -  1') 

LDX  IR2  '2  *  P  -  1'  ; 

CJSR  PORC  OUTPUT  'P-1'  ; 

BALT  ; 

INPUT: 

SPX  IR2  '1'  ;  R1  =  IO-LD{'0') 

;  R2  =  ADD('1',R2) 

LTST  ICTO  INPUT  ;  STORE (Rl,  R2) 

OUTER: 

;  R4  =  LOAD(Rl) 

;  R1  =  ADD('l',  Rl) 

;  R2  ADD<R7,  RO) 

;  R5  «  L02U3(R2) 

;  R2  s  ADD('1',R2) 

;  LCfUSH^Q(R2,R8) 

;  R2  =  PASS(R7} 

CJSR  PORC  INNER  'P-2'  ;  [POP]  R6  »  MULT(R4,  R5) 
;  STORE  (R6,R3) 

LTST  ICTO  OUTER  ;  R3  »  ADD('l',  R3) 

INNER: 

;  R4  «  LDN0(R4) 

;  R5  s  LOAD(R2) 

;  R2  =  ADD('1',R2} 

;  LCJ>USH^Q(R2,  RS) 

;  R2  s  PASS(R7) 

;  [POP]  R9  =  N0LT(R4,  R5) 

LTST  ICTO  INNER  ;  R6  =  ADD(R6,  R9) 

OUTPUT : 

;  R2  s  ADD('1',R2) 

;  Rl  s  LOAD(R2) 

SPX  IR2  '1'  ;  IO-ST(Rl, '0') 

LTST  ICTO  OUTPUT  ; 


Figure  C.3:  Assembly  Language  Program  for  Linear  Array  Square  Matrix  Multiply 


C.3.  PHYSICALLY  STRUCTURED  GENERIC  SIMD  COMPUTATION 
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1 .  Number  of  PEs  in  the  computer  {P} 

2.  Number  of  PE  btiilding  blocks  in  the  computer  {Af } 

3.  Number  of  PEs  per  PE  chip  {A’}  (* 

4.  Per-PE  local  external  memory  size  {LEMSize} 

5.  PE  register  file  size  {PERFSize} 

6.  The  set  of  possible  inter-PE  communication  operations.  This  set  is  determined  by  the  topology 
of  the  computers  inter-PE  communication  network. 

7.  FU  and  MCS  operation  stepco\mts,  determined  by  such  factors  as 

•  PE  datapath  width  (in  bits)  relative  to  problem  data  word-width, 

•  FU  circuit  complexity  (For  example,  multipliers  and  barrel  shifters  require  more  complex 
circuits  than  do  adders  and  distance-1  shifters.),  and 

•  PE  chip  pin  time-sharing  for  MCSs. 

C.3  Physically  Structured  Generic  SIMD  Computation 

At  this  final  stage  of  the  derivation,  the  latencies  of  all  FU  and  MCS  operations  are  explicit,  and 
high-latenpy  operations  are  explicitly  overlapped  where  possible  with  other  operations  that  are  flow- 
independent. 

Figure  C.4  shows  a  program  produced  from  the  program  in  Figure  C.3  that  provides  a  1024-PE 
SIMD-D  throughput  baseline. 

C.4  Derivation  of  Multii-Clock  Computation 

Multi-clock  SIMD  computation  is  I-cached  SIMD  computation  without  the  I-cache  itself  The  local 
controller  of  a  multi-clock  SIMD  computer  contains  the  m\jlti-clock  generator  that  supplies  the  MCSs 
with  clocks,  each  at  its  highest  rate.  The  throughput  on  a  multi-dock  SIMD  computer  is  compared 
with  the  throughput  baseline  to  indicate  how  much  speedup  is  attained  firom  multi-clocking  alone. 

The  assembly  language  program  describing  an  operationally  structured  generic  SIMD  compu¬ 
tation  also  describes  an  operationally  structured  multi-dock  SIMD  computation.  The  difference 
between  the  corresponding  physically  structured  computations  is  that  the  p-set  describing  the  rela¬ 
tive  MCS  dock  rates  is  an  additional  set  of  parameters  in  the  translation  to  the  physically  structured 
multi-clock  computation. 

C.5  Derivation  of  I-Cached  Computations 

An  operationally  structured  I-cached  computation  is  obtained  by  changing  the  assembly  language 
program  for  the  generic  SIMD  computation.  The  changes  involve  re-ordering  some  of  the  instructions 
and  adding  cache-control  instructions  as  required  to  store  and  subsequently  activate  routines  in 
cache. 

The  physically  structured  I-cached  computation  is  obtained  by  assembling  and  scheduling  the 
program  using  operation  stepcounts,  a  p-set,  and  cache  size  as  parameters. 


C.6.  MEASURED  THROUGHPUT  AND  SPEEDUP 
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Operationally  Structured  Fo  Computation 

Figure  C.5  shows  a  program  for  the  matrix  multiphcation  computation  using  an  Fq  cache. 
Operationally  Structured  F2  Computation 

Figure  C.6  shows  a  program  for  the  matrix  miiltiplication  computation  using  an  F2  cache. 


C.6  Measured  Throughput  and  Speedup 

The  assembly  language  programs  are  recompiled,  simulated,  and  verified.  Measurements  shown  in 
the  following  figures  for  each  of  the  four  SIMD  computer  variants  SIMD-A,  SIMD-B,  SIMD-C,  and 
SIMD-D.  Measurements  are  taken  for  pi,  ranging  fiom  1  to  16  using  the  p-sets  {iV,  N,  N,  N,  N}  and 
{A\l, 1,1,1}.  The  cache  sizes  used  to  obtain  the  the  measured  I-cache  speedups  in  the  four  SIMD 
computer  vEuiants  were  1196 . . .  1204,  307 . .  .312,  48 . . .  62,  and  10 . . .  22,  respectively. 
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APPENDIX  C.  ILLUSTRATED  EXAMPLE  OF  SPEEDUP  MEASUREMENT 


program  Ismm-lce^ss  (iKpL,  «C-pi,  P)  : 

U>X  IRO  '0'  ;  R2  s  LZTERAL( 'CA^i  -  1') 

UDX  IR2  '-1'  ; 

CJSR  PORC  INPUT  'P-1'  ; 

LDX  IRO  '0'  ;  R2  s  LITERAL( -  1') 

LDX  IR2  'P-1'  ; 

CJSR  PORC  INPUT  'P-1'  ; 

CACHEl: 

;  CC3STO 
;  R4  s  ion0(R4) 

;  R5  s  L0AD(R2) 

;  R2  =  ADD('1',R2) 

;  LC-PUSR^Q(R2,  R8) 

;  R2  s  PASS(R7) 

;  [POP]  R9  =  MULT(R4,  R5) 

;  R6  a  add(R6,  R9) 

;  CC^STO 

LDX  IRO  '0'  ;  R1  a  LITERAL ( '6Api' ) 

LDX  IRO  '0'  ;  R7  a  LITERAL ( '&B^i' ) 

IDX  IRO  '0'  ;  R8  a  LITERAL ( '&B4>i  -f  P') 

LDX  IRO  '0'  ;  R3  a  LITERAL ( '&C^i' ) 

CJSR  PORC  OUTER  'P-1'  ; 

LDX  IRO  '0'  ;  R2  a  LITERAL (' SOpi  -  1') 

LDX  IR2  '2  *  P  -  1'  ; 

CJSR  PORC  OUTPUT  'P-1'  ; 

HALT  ; 

INPUT: 

SPX  1R2  '1'  ;  R1  a  IOJJ)('0') 

;  R2  a  ADD('1',R2) 

LTST  ZCTO  INPUT  ;  STORE (Rl,  R2} 

OUTER: 

;  R4  a  LOAD(RI) 

;  R1  a  ADD('1',  R1) 

;  R2  a  ADD(R7,  RO) 

;  R5  a  loAD(R2) 

;  R2  a  ADD('1',R2) 

;  LC-PUSREQ(R2,R8) 

;  R2  a  paSS(R7) 

CJSR  PORC  INNER  'P-2'  ;  [POP]  R6  =  MDLT(R4,  R5) 
;  STORE (R6,R3} 

LTST  ICTO  OUTER  ;  R3  =  ADD('l',  R3) 

INNER: 

FORK  CACHEl  ; 

LTST  ICTO  INNER  ; 

OUTPUT: 

;  R2  a  ADD('1',R2) 

;  R1  a  L0AD(R2) 

SPX  IR2  '1'  ;  IO-ST(Rl, '0') 

LTST  ICTO  OtJTPUT  ; 


Figure  C.5:  assembly  Language  Program  for  Fq  Linear  Array  Square  Matrix  Multiply 
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program  l8snL2ce-ass  (£A.pi,  £B.pi,  &C^i,  P)  : 

LDX  IRO  '0'  ;  R2  *  LITERAL  ( 'fiA-pi  -  1') 
LDX  IR2  '-1'  ; 

CJSR  FORC  INPOT  'P-1'  ; 

LDX  IRO  '0'  ;  R2  =  LITERAL ( -  1') 
lUX  IR2  'P-1'  ; 

CJSR  FORC  INPOT  'P-1'  ; 

CACHEl: 

;  CC^STO 
;  R4  =  LDN0(R4) 

;  R5  =  LOAD(R2) 

;  R2  =  ADD('1',R2) 

;  LC^OSa-BQ(R2,  R8} 

;  R2  =  PASS(R7) 

;  [POP]  R9  *  MOLT(R4,  R5) 

;  R6  =  ADD(R6,  R9) 

;  CC^STO 

LDX  IRO  '0'  ;  R1  =  LITERAL ( '4A4>i' ) 

LDX  IRO  '0'  ;  R7  =  LITERAL ( '&B4>i' ) 

LDX  IRO  '0'  ;  R8  =  LITERAL ( 'tB4>i  +  P' ) 
LDX  IRO  '0'  ;  R3  =  LITERAL ( 'fiC^i' ) 
CJSR  FORC  OOTER  'P-1'  ; 

IDX  IRO  '0'  ;  R2  =  LITERAL { 'fiC^Ji  -  1' ) 
ICX  1R2  '2  *  P  -  1'  ; 

CJSR  FORC  OOTPOT  'P-1'  ; 

HALT  ; 

INPOT: 

SPX  IR2  '1'  ;  R1  =  lO-LD('O') 

;  R2  =  ADD('1',F2) 

list  ICTO  inpot  ;  STORE (Rl,  R2) 

OUTER: 

;  R4  =  LOAD(Rl) 

;  Rl  =  ADD('l',  Rl) 

;  R2  =  ADD(R7,  RO) 

;  R5  =  L0AD(R2) 

;  R2  =  ADD('1',R2) 

;  LC^OSREQ<R2,R8) 

;  R2  =  PASS(R7) 

;  [POP]  R6  =  M0LT(R4,  R5) 

FORK  CACHEl  'P-2'  ; 

;  STORE  (R6,R3) 

LIST  ICTO  OOTER  ;  R3  =  ADD('l',  R3) 

OUTPUT: 

;  R2  =  ADD{'1',R2) 

;  Rl  =  L0AD(R2) 

SPX  IR2  '1'  ;  IO^T(Rl, '0') 
list  ICTO  OUTPUT  ; 


Figure  C.6:  assembly  Language  Program  for  F2  Linear  Array  Square  Matrix  Multiply 


Speedup  Relative  to  Generic  SIMO  Computer 


Speedup  Relative  to  Generic  SIMO  Computer 
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The  Sample  Programs 

This  appendix  shows  the  assembly  language  programs  used  in  the  I-cache  evaluations. 
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APPENDIX  D.  THE  SAMPLE  PROGRAMS 


program  tree^um.!  (logP)  : 

!  tree  sunmatlon.  logP  la  log(P),  where  P  is  the  nuiober  o£  PEs. 
!  loop  inder  la  in  R5. 

I 

!  Load  the  datiim  A[l]  fron  system  data  memory  Into  Rl: 

UiX  IRO  '0'  ;  Rl  >  IO-U>('0') 

;  R5  =  PASS ('-!') 

CJSR  FORC  SUMOP  'logP  -  1'  ;  R3  =  PASS(RO) 

1  wake  all  the  PEs  up  again 
;  [CLR] 

!  PE  0  provides  the  answer,  aiaslc  all  other  PE'  a  answers 
;  LCJPUSBJ1E('0',R0) 

;  Rl  s  PASS('O') 

!  Store  the  result  from  Rl  Into  system  data  memory; 

LDX  IRO  '0'  ;  I0^T(R1, '!') 

BALT  ; 

SUMUP: 

;  R2  s  PASS(RI) 

;  R6  s  AND('1',R3) 

;  LCfUSB^('0',R6) 

;  R4  «  ADD('1',R3) 

;  [INV]  R4  =  ADD('-1',R3) 

;  [POP]  R5  »  ADD('1',R5} 

;  R4  -  LSHIFT(R4,R5) 

;  Z.C^nSH-EQ('0' ,R6) 

;  R2  s  ROUTE (R2,R4) 

;  [INV]  R4  *  PASS('-l') 

;  [INV]  Rl  s  ADD(R1,R2) 

;  R6  *  PASS('l') 

LTST  ICTO  SUMOP  ;  R3  s  RSBIFT (R3,  R6) 


Figure  D.l:  Assembly  Language  Program  for  tree 
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pxogxaa  •xol.plus.sean  (logP) : 

!  axcluslv*  plu*  scan. 

!  Tha  axguMBt  logP  la  log(P),  lAara  P  la  tha  nuobaz  of  PBa. 

!  Tha  local  taoffax  of  "L**  valuaa  atazta  a^  IJa<[0] ,  idilla 
!  nia  local  buffar  of  Indaxaa  ataxta  at  ZJBM[lo^] 

!  Mo  awaalng  axound,  ao  ladax  xaglatar  ayataai  only  uaad  if  nacaaaaxy 

I 

!  Load  tba  datw  A[pl]  fzca  ayatan  data  aanoxy  into  Rl: 

LDX  UtO  '0'  ;  Rl  >  ZOJU>('0') 

;  R2  >  PRSS(R0) 

;  RB  -  PR8S('-1') 

;  R3  >  PAS8('0') 

;  R5  >  PAS8('0') 

;  R4  -  PASS('l') 

CJSR  rone  SHBPOP  'lo^  -  l'  ; 

;  Rl  at  PASS('O') 

;  R8  -I  JU>D('1',R8) 

;  [PRC]  R9  a  PRSS('O') 

CJSR  FQBC  SNKEPDM  'logP  -  1' 

!  Stoza  tha  zaault  fzon  Rl  into  ayatan  data  mamozy; 

US  ZRO  '0'  ;  IO^(Rl, '1') 

BRLT  ; 

SNEEPOP: 

;  RIO  a  MD('1',R2) 

;  R8  a  AD0('1',R8) 

;  ZjClPnSB^('0',R10) 

;  R7  a  AD0('1',R2) 

;  R7  a  Z,8Bm(R7,R8) 

;  R7  a  0R(R7,R5) 

;  IXmn  R7  a  PRSS('-I') 

;  R5  a  LSBIFT(R5,R4) 

;  RS  a  QR('1',R5) 

;  [POPl 

;  LCJPOSHJa!('0',R10) 

;  R«  a  ROOTS  (R1,R7) 

;  [IMV]  R7  a  PASS('-l') 

;  [IMP]  Rl  a  RDD(R1,R6) 

;  STORE  (R6,R3) 

;  R3  a  JU)0('1',R3) 

LTST  zero  SNEEPOP  ;  R2  a  RSEZPT(R2,R4) 

SNEEPDM: 

;  RS  a  RSHZrT(RS,R4) 

;  R7  a  Z.SHZrT(R2,R8) 

;  R8  a  ]U>0('-1',R8) 

;  R7  a  0R(R7,R5) 

;  R2  a  LSHZFT(B2,R4) 

;  R2  a  0R(R2,R4) 

;  R9  a  PRSS('l') 

;  [POP] 

;  LCJOSHJEQC'O' ,R9) 

;  Rl  a  R00TE(R1,R7) 

;  [ZMVJ  R3  a  JU}0('-1',R3) 

;  R6  a  L0AD(R3) 

;  Rl  a  RDD(R6,R1) 

LTST  zero  SNEEPDM  ;  [POP] 


Fig\ire  D.2:  Assembly  Language  Program  for  scan 
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APPENDIX  D.  THE  SAMPLE  PROGRAMS 


program  bufabl«-Sort  (P) : 

!  Sorting  P  valuma  on  a  P-al— nt  llnnar  array 

!  Parform  local  swaps  tantll  no  bodiy  dataets  any  out  of  ordar  values . 

! 

LDX  XRO  '0'  ;  Rl  >  XOJi)('0') 

;  R3  -  PASS('O') 

;  L(1PUSRXQ('0',RO) 

;  R3  s  PASS('l') 

LDX  XRO  '0'  ;  [POP]  R7  -  LXXERAL( 'P-1' ) 

;  LCJ?OSH-BQ(R7,RO) 

;  R3  >  PASS('l') 

;  [POP]  R5  >RHD('1',R0) 

;  LCJPDSRJBQC'O' ,R5) 

;  R4  ■  PAS8('l') 

;  [XMV]  R4  ■PAS8{'0') 

!  juap  to  tha  sorting  stop  innar  loop  — 

!  2019  is  tha  isaasuiMd  nuabar  of  STEP  Itaratlons  for  a  4K-alonant 
!  problam-lnstanoa;  providing  this  nnndbar  In  tha  CJSR  lots  tha  eoopllar 
!  figure  out  how  long  tha  ooa^potatlon  runs. 

CJSR  rose  STEP  '2018'  ;  [POP] 

LDX  XRO  '0'  ;  XOj3T(Rl, '0' ) 

HALT  ; 

STEP : 

;  R6  V  PASS('O') 

!  even  phase : 

;  LCJPOSa^('l',R4) 

;  R2  a  U>N0(R1) 

;  [XHV]  R2  a  LUP0(K1) 

;  [PCS]  LCJPDSB-EQ('1',R4) 

;  LCJPnSHXT(R2,Rl) 

;  Rl  a  PASS(R2) 

;  R6  a  PASS('l') 

;  [POP] 

;  [XltV]  LCLP0SEJ.T(R1,R2) 

;  Rl  a  PASS(R2) 

;  [POP] 

!  odd  phase : 

;  [POP]  LaPnSBJEQ('0',R4) 

;  R2  a  U»0(R1) 

;  [XHV]  R2  a  LDP0(R1) 

;  [POP]  LCLPOSH^('0' ,R3) 

;  LC^USBJEQC'l' ,R4) 

;  LC^USH-LT(R1,R2) 

;  Rl  a  PASS(R2) 

;  R6  a  PASS('l') 

;  [POP] 

;  [XHV]  LCJUSHa.T(R2,Rl) 

;  Rl  a  PASS(R2) 

;  [POT] 

;  [POT] 

;  [POP] 

LTST  RSPO  STEP  ;  RESPOND (R6) 


Figiire  D.3:  Assembly  Language  Program  for  bubble 


program  report  (sqrtP,  lo^)  : 

*  Sorting  P  ▼alu«s  on  a  P-eleraent  square  mesh  using  Leighton's 
!  alternating  row  and  column  bubble  sort. 

! 

!  Input  the  array  to  be  sorted: 

U)X  IRO  '0'  ;  R1  =  10_IJD('0') 

!  Set  up  the  sorting  loops 
;  R4  =  PASS('O') 

U>X  IRO  '0'  ;  R8  =  LITERAL ( 'sqrtP' ) 

;  R9  -  H0(D(R0,R8) 

;  LCJ>USH-LT(R0,R8) 

;  R4  s  PASS('-l') 

LDX  IRO  '0'  ;  [POP]  RIO  =  LITERAL ( 'sqrtP  *  (sqrtP  -  1) ' ) 
;  LC-PnSH.GE(R0,R10) 

;  R4  =  PASS('-l') 

;  [POP]  R3  s  PASS('O') 

;  LC-PnSH-EQ('0' ,R9) 

;  R3  =  PASS('-l') 

;  [POP]  RIO  *  ADD('-1',R8) 

;  LC-PnSH.£Q(R9,R10) 

;  R3  =  PASS <'-!') 

;  [POP]  Rll  =  DIV<R0,R8) 

;  Rll  «  AMD('l' ,R11) 

;  R14  =  AilD('l'  ,R9) 

;  LC-PUSH_EQ('0' ,R11) 

;  R5  =  PASS('-l') 

;  LC.PUSBJBQ('0' ,R14) 

;  R6  =  PASS('-l') 

;  [INV]  R6  -  PASS('O') 

;  [POP] 

;  [mv]  R5  *  PASS('O') 

;  LCLPUSHJBQ('0' ,R14) 

;  R6  =  PASS('O') 

;  [INV]  R6  *  PASS('-l') 

;  [POP] 

!  Do  the  sort: 

CJSR  FORC  SORT^TEP  'logP  -  1'  ;  [POP] 

!  Set  up  the  reversal  for  odd- indexed  rows  (works  for  sqrtP  >  2} 
;  LC.FUSHJQ('0' ,R5) 

;  R12  =  ADD('-1' ,R8) 

;  R12  =  SUB(R12,R9} 

;  R13  =  ADD('l' ,R9) 

;  LC^USH.GE(R13,R8) 

;  R13  s  STm(R13,R8) 

;  [POP]  R2  =  SDNO(Rl) 

;  LC^nsa.£Q(R13,R12) 

CJSR  FORC  REVERSE^TEP  'sqrtP  /  2  -  2' ;  R1  =  PASS(R2) 

;  [POP] 

;  [POP] 


Figure  D.4:  (continued  next  page) 
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APPENDIX  D.  THE  SAMPLE  PROGRAMS 


!  Output  the  sorted  array: 

IDX  IRO  '0'  ;  10^T(R1, '0') 

HiaT  ; 

SORT-STEP : 

!  Simulation  measures  217  iterations  for  4K  data  set 
!  But  gets  counted  logP«12  times  "=  18  iters  (18.1) 
CJSR  FORC  SORX-ROir  '17'  ; 

!  Simulation  measures  58  Iterations  for  4K  data  set 
!  But  gets  counted  logP»12  times  5  iters  (4.9) 
CJSR  FORC  SORT.COL  '4'; 

LTST  ICTO  SORX.STEP  ; 

SORTJROW: 

;  R7  =  PRSS('O') 

!  even  phase : 

;  LCJPnSH.EQ(R5,R6) 

;  R2  »  SDNO(Rl) 

;  [XMV]  R2  s  SDPO(Rl) 

;  [POP]  LCJraSR-MX  (  ' 0 '  ,  R6) 

;  LC.PUSB-LT(R2,R1) 

;  R1  «  PASS(R2) 

;  R7  =  PASS('-l') 

;  [POP] 

;  [IMV]  LC.PTTSBj6T(R2,R1) 

;  R1  s  PASS(R2) 

;  [POP] 

!  odd  phase : 

;  [POP]  LCLPUSHJIB(R5,R6) 

;  R2  -  SDNO(Rl) 

;  [INV]  R2  =  SUPO(Rl) 

;  [POP]  LC.PUSH.EQ('0',R3) 

;  1.C.PUSH.EQ('0' ,R6) 

;  LC.PUSH-LT(R2,R1) 

;  R1  s  PASS(R2) 

;  R7  =  PASS('-l') 

;  [POP] 

;  [INV]  I.C.PUSHj8T(R2,R1) 

;  R1  s  PASS(R2) 

;  [POP] 

;  [POP] 

;  [POP] 

LTST  RSPO  SORT.ROW  ;  RESPOND  (R7) 


Figure  D.4:  (continued  next  page) 
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SORTXOL: 

;  R7  »  PASS('O') 

!  even  phase: 

;  LC-PUSHJHB('0' ,R5) 

;  R2  »  SD1I1<R1) 

;  [INV]  R2  »  SnPl(Rl) 

;  [IMV]  LC-PUSB-LT(R2,R1) 

;  R1  s  PASS(R2) 

;  R7  »  PASS('-l') 

;  [POP] 

;  [IMV]  LCfnSHjST(R2,Rl) 

;  R1  «  PASS(R2) 

;  [POP] 

!  odd  phase: 

;  [POP]  LC-PUSHJBQ('0',R5) 

;  R2  »  SDNl(Rl) 

;  [IMV]  R2  e  SUPl(Rl) 

;  [POP]  LC-PUSB^('0',R4) 

;  LOPUSHJEQC'O' ,R5) 

;  l£JPnSHJC<T(R2,Rl) 

;  R1  »  PASS(R2) 

;  R7  =  PASS ('-!') 

;  [POP] 

;  [IMV]  LCJPnSHJST(R2,Rl) 

;  R1  s  PASS(R2) 

;  [POP] 

;  [POP] 

;  [POP] 

X.TST  RSPO  SORTjCOL  ;  RESPOND  (R7) 
REVERSE^TEP: 

;  [POP]  R2  =  SDM0(R2) 

;  R13  =  ADD('2',R13) 

;  LC-PnSHjGE(R13,R8) 

;  R13  =  SUB(R13,R8) 

;  [POP]  R2  =  SDM0(R2) 

;  LC-PnSH-EQ(R13,R12) 

LTST  ICTO  REVERSE^TEP  ;  R1  =  PhSS(R2) 


Figure  D.4:  Assembly  Language  Program  for  rowcol 
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APPENDIX  D.  THE  SAMPLE  PROGRAMS 


program  bitonic-Sort  (logP)  : 

!  Bitonic  sort  on  P  elements. 

) 

!  Load  the  datiaa  A[l]  from  system  data  memory  bank  address  0  into  R6: 
U>X  IRO  '0'  ;  R6  =  IOJU)('0') 

LDX  IRO  '0'  ;  Rll  =  LITERAL ( 'logP' ) 

LDX  IRO  '0'  ;  R12  s  LITERAL('l') 

;  R1  =  LSHIFT('1',R11) 

;  R2  =  PASS(Rl) 

LDX  IRl  '-1'  ; 

CJSR  FORC  OUTER  'logP  1'  ; 

!  Store  the  result  from  R6  back  into  system  data  memory  bank  address  0; 
LDX  IRO  '0'  ;  IOJ5T(R6, '0' ) 

BALT  ; 

OOTER: 

;  R2  s  RSRirT(R2,R12) 

;  R4  -  PASS('O') 

;  R5  s  PASS(R2) 

;  R3  -  PASS(Rl) 

SPX  IRl  '1'  ; 

0 

!  When  the  iteration  count  specified  in  the  CJSR  instruction  is  less 
!  than  0,  the  system  controller  indexer  subsystem  provides  the  loop 
!  iteration  count : 

CJSR  FORC  INNER  '-1'  ; 

LTST  ICTO  OUTER  ; 


Figure  D.5:  (continued  next  page) 
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INNER: 

;  R3  »  RSBirT(R3,R12) 

!  evaluate  first  condition 
;  R13  «  SUB(R1,R5) 

;  l.(LPt7SB_Z.T(R0,R13) 

;  R15  s  »ASS('l') 

;  [ZMV]  R15  »  PASS('O') 

;  [POP]  R14  s  RMD(R0,R2) 

;  LCfUSBLEQ(R14fR4) 

;  R15  a  AMD('1',R15) 

;  [INV]  R15  =  PASS('O') 

!  first  'then'  leg 

;  [POP]  IiCJ*USHJQ('l' ,R15) 

;  R9  «  PASS('l') 

;  RIO  s  PASS('O') 

;  R8  a  ADD(R0,R5} 

!  evaluate  second  condition 

;  [INV]  R13  -  SUB(R0,R5) 

;  Z.C-PnSR^('0' ,R13) 

;  R15  =  PASS('l') 

;  (331V]  R15  =  PASS('O') 

;  [POP]  R14  s:  AND(R13,R2) 

;  LaPnsa^(R14,R4) 

;  R15  *  3kND('l'  ,R15) 

;  [mV]  R15  =  PASS('O') 

!  second  'then'  leg 

;  [POP]  ULPtISHJSQ('l',R15) 

;  R9  «  PASS('0'} 

;  RIO  s  PASS('l') 

;  R8  s  StIB(R0,R5) 

!  last  leg  of  conditional 

;  [mV]  R9  «  PASS('O') 

;  RIO  *  PASS('O') 

;  R8  *  PASS ('-!') 

;  [POP] 

!  pass,  coapare,  conditionally  swap 
;  [POP]  R7  *  ROUTE (R6,R8) 

;  LC^USR-LT(R7,R6) 

;  tC^USH-EQ(’l',R9) 

;  R8  »  PASS(R7) 

;  [POP] 

;  [POP]  I.C-PUSHjGT(R7,R€) 

;  LCJ»USH-EQ('1',R10) 

;  R6  a  PASS(R7) 

;  [POP] 

;  [POP]  R5  «  SUB(R3,R2) 

LTST  ICTO  INNER  ;  R4  =  PASS(R2) 


Figure  D.5;  Assembly  Language  Program  for  bltonic 
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APPENDED.  THE  SAMPLE  PROGRAMS 


program  IsnBLass  tBjpL,  CC^l,  P)  : 

LDX  IRO  '0'  ;  R2  s  LITERAL ( '£A4>1  -  1') 

LDX  1R2  '-1'  ; 

CJSR  PORC  IHPUT  'P-1'  ; 

LDX  IRO  '0'  ;  R2  »  LITERAL ( -  1') 

LDX  IR2  'P-1'  ; 

CJSR  FORC  INPUT  'P-1'  ; 

LDX  IRO  '0'  ;  R1  »  LITERAL  ( '£A4>i' ) 

LDX  IRO  '0'  ;  R7  s  LITERAL ( 'CB4>i' ) 

LDX  IRO  '0'  ;  R8  =  LITERAL ( '«B4>i  -f  P' ) 

IDX  IRO  '0'  ;  R3  »  LITERAL ( '«CLpi' } 

CJSR  FORC  OUTER  'P-1'  ; 

LDX  IRO  '0'  ;  R2  =  LITERAL ( 'fiCLpi  -  1') 

LDX  IR2  '2  *  P  -  1'  ; 

CJSR  FORC  OUTPUT  'P-1'  ; 

HALT  ; 

INPUT: 

SPX  IR2  '1'  ;  R1  *  IO-LD<'0') 

;  R2  s  ADD('l' ,R2) 

LTST  ICTO  INPUT  ;  STORE (Rl,  R2) 

OUTER: 

;  R4  =  LOAD(Rl) 

;  Rl  =  ADD('l' ,  Rl) 

;  R2  =:  ADD(R7,  RO) 

;  R5  s  IiOAD(R2) 

;  R2  s  ADD('1',R2) 

;  LC^USH-EQ(R2,R8) 

;  R2  =  PASS(R7) 

CJSR  FORC  INNER  'P-2'  ;  [POP]  R6  =  MULT(R4,  R5) 
;  STORE (R6,R3) 

LTST  ICTO  OUTER  ;  R3  =  ADD('l',  R3) 

INNER: 

;  R4  =  LDN0(R4) 

;  R5  =  LOAD(R2) 

;  R2  s  ADD( '1' ,R2) 

;  LC^USH-EQ(R2,  R8) 

;  R2  s  PASS(R7) 

;  [POP]  R9  =  NULT(R4,  R5) 

LTST  ICTO  INNER  ;  R6  »  ADD(R6,  R9) 

OUTPUT: 

;  R2  s  ADD('1',R2) 

;  Rl  s  LOAD(R2) 

SPX  IR2  '1'  ;  IO-ST{Rl, '0') 

LTST  ICTO  OUTPUT  ; 


Figure  D.6:  Assembly  Lang^uage  Program  for  mataul 


program  mash^obel  (sqrtP)  : 

!  Load  the  input  pixel: 

LDX  IRO  '0'  ;  R1  *  IOJJ)('0') 

!  Flag  the  edge  pixels: 

LDX  IRO  '0'  ;  R8  *  LITERAL ( 'aqrtP' ) 

;  R7  =  liCX>(R0,R8) 

LDX  IRO  '0'  ;  R2  =  LITERAL ( 'aqrtP  *  (sqrtP  -  1) ' 
;  R3  «  ADD('-1',R8) 

;  L(LPUSH-LT(R0,R8) 

;  R9  »  PASS('-l') 

;  (mV]  R9  =  PASS  (’O') 

;  [POP]  LCLPUSHjSE(R0,R2) 

;  R12  =  PASS('-l') 

;  [IMV]  R12  =  PASS  (’O') 

;  [POP]  LC-PUSH-EQ('0' ,R7) 

;  RIO  =  PASS('-l') 

;  [INV]  RIO  =  PASS (’O') 

;  [POP]  LC-PUSaJEQ(R3,R7) 

;  Rll  *  PASS('-l') 

;  [INV]  Rll  =  PASS  (’O') 

!  Calculate  the  x  and  y  gradients: 

;  [POP]  R2  =  SDNO(Rl) 

;  LC_PUSHJSQ('-1' ,R11) 

;  R2  =  PASS(Rl) 

;  [POP]  R3  =  StJPO(Rl) 

;  LC-PUSHJ:Q('-1',R10) 

;  R3  s  PASS(R1} 

;  [POP]  R13  =  PASS('l') 

;  R4  s  LSHIFT(R1,R13)  !  pix  *  2 
;  R4  s  ADD(R3,R4) 

;  R4  s  ADD(R4,R2) 

;  R6  =  SDN1(R4) 

;  LaPUSHJEQ{'-l' ,R12) 

;  R6  =  PASS(R4) 

;  [POP]  R5  =  SDP1(R4) 

;  LCJPOSH-EQ('-l' ,R9) 

;  R5  s  PASS(R4) 

;  [POP]  R8  s  SUB(R5,R6} 

;  R4  »  SUB(R2,R3} 

;  R€  -  SDN1(R4) 

;  LC-PUSH_EQ(’-1',R12) 

;  R6  s  PASS(R4) 

;  [POP]  R5  =  SUP1(R4) 

;  LCJUSHJ:Q('-1' ,R9) 

;  R5  s  PASS(R4) 

;  [POP]  R7  =  LSBIFT(R4,R13)  !  R4  *  2 
;  R7  =  ADD(R5,R7) 

;  R7  =  ADD(R7,R6) 


Figure  D.7:  (continued  next  page) 
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!  Iteratively  calculate  the  root  square  gradient  value: 
;  R7  =  MUX.T(R7,R7) 

;  R8  s  MDX.T(R8,R8) 

;  R2  s  ADO(R7,R8) 

;  R3  -  RSBIFT(R2,R13) 

;  R4  s  MDXiT(R3,R3) 

CJSR  FORC  SQRS  '12'  ;  R4  »  SnB(R4,R2) 

!  Write  the  gradient  aagnltude  result 
U)X  IRO  '0'  ;  10^T(R3, '!') 

HALT  ; 

SQRT: 

;  LC-PDSH-EQ( '0' ,R4) 

;  R9  s  PASS('O') 

;  [IMV]  R5  s  LSBZrT(R3,Rt3} 

;  R5  »  DZV(R4,R5) 

;  LCJ?USBJSQ('0' ,R5) 

;  RIO  =  PASS('-l') 

;  LCJ?USRJC.T('0' ,R4) 

;  R5  *  PASS ('I') 

;  [INV]  R5  =  PASS('-l') 

;  [POP] 

;  [INV]  RIO  =  PASS('O') 

;  [POP]  R«  »  SnB(R3,R5) 

;  R7  a:  MIJLT(R€,R€) 

;  R7  =  SUB(R7,R2) 

;  LC^USHJ(Q('-1' ,R10) 

;  LCLPUSB-6T('0',R4) 

;  R4  =  SUB('0',R4) 

;  [POP]  LCJ?OSHjST<'0',R7) 

;  R7  =  SOB('0',R7) 

;  [POP] 

;  LCJ>nSBj;.T(R7,R4) 

;  R3  =  PASS(R6) 

;  [POP]  R9  =  PASS('O') 

;  R4  *  PASS('O') 

;  [INV]  R3  =  PASS(RS) 

;  R4  =  PASS(R7} 

;  R9  =  PASS('-l') 

;  [POP] 

;  [POP] 

LTST  RSPO  SQRT  ;  RESPOND (R9) 


Figure  D.7:  Assembly  Language  Program  for  sobel 


program  madian(Cbtt£,P)  : 

!  Cbuf  la  addzmaa  la  local  axtaxnal  ammozy  at  9-alamaat  buffer  that 
!  la  uaad  to  hold  each  of  the  lataat  3  aortad  local  rowa. 

!  P  la  the  auriaar  of  PXa  aad  tha  auubar  of  placala  par  acaallaa  and 
!  tha  nuadbaue  of  aeaallaaa. 


IRl  holda  tha  laput  llna  numbar,  ZB2  holda  tha  output  llaa  nuabar. 


IDX  na  '-2' 

SPX  m  '2'  ;  RIO  «  XOJJ}('0') 

LOX  IR2  '-1'  ; 

ZJ3X  ZRO  '0'  ;  R14  «  LZTEMa.( '512' ) 

ZiaC  XRO  '0'  ;  R15  «  xanBRX.('P  -  1') 

xxnc  XRO  '0'  ;  R5  -  XOnSALC 'Kbuf ' ) 

LDX  XRO  '0'  ;  R3  «  LXTERM. ( ’ Cbuf  +  3') 

LDX  XRO  '0'  ;  R1  >  X.XTERAL ( ' «buf  •••  6') 

;  R6  -  PhSS(R5) 

;  R4  -  PASS(R3) 

;  R2  a  PR88(R1) 

!  Grab  tha  flrat  llaa,  aort  tha  local  3  plxala 
;  R7  a  PASS (RIO) 

SPX  XRl  '2'  ;  RIO  a  XOJi>('0') 

;  R8  a  XJaP0(R7) 

;  LCJOSa^('0' ,R0) 

;  R8  a  PASS(R7) 

;  [POP]  R9  a  U»I0(R7) 

;  LCPOSB^(R15,RO) 

;  R9  a  PASS(R7) 

CJSR  rose  SORIJ  '0'  ;  [POP] 

!  Store  the  locally  aortad  flrat  row  twice 
;  STOPS (R7,R6) 

;  R6  a  AI»('1',R6) 

;  STOPS  (R8,R6) 

;  R6  a  ADD('1',R6) 

;  STOPS (R9,Rfi) 

;  R6  a  PASS(R5) 

;  STOPS  (R7,R4) 

;  R4  a  AD0('l'  ,R4) 

;  STOPS (R8,R4) 

;  R4  a  ADD('1',R4) 

;  STOPS (R9,R4) 

;  R4  a  PASS(R3) 

!  Grab  tha  aacond  row,  aort  tha  local  3  plxala 
;  R7  a  PASS (RIO) 

SPX  XRl  '2'  ;  RIO  a  XO-Xi>('0') 

;  R8  a  X1IP0(R7) 

;  LCJOSBJK2('0',RO) 

;  R8  a  PASS(R7) 

;  [POP]  R9  a  XJ>N0(R7) 

;  XiCJ0SB_BQ(R15,RO) 

;  R9  a  PASS(R7) 

CJSR  rORC  SQRT^  '0'  ;  [POP] 


Pigtxre  D.8:  (continued  next  page) 
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!  Stor*  th«  looally  soxt^d  row  to  mmaory,  mmt  up  local  madlan  calc: 

;  ST<»X(I17,B2) 

;  R2  »  JU>D('l'  ,R2) 

;  STORE  (Ii8,S2) 

;  R2  >  AIS>('1',R2) 

;  STORE  (R9,R2) 

;  R2  a  EDD('-1' .R2) 

;  R8  a  XiQRD(R4) 

;  R4  a  AIH>('1'  ,R4) 

;  R9  a  L0RD(R6) 

!  Skip  ovar  tba  Sour  laaat  of  tha  9  pixala  la  tba  3  aoxtad  buffars: 

CJSR  rORC  SKIP  TERST  '3'  ;  R6  a  AD0('1',R6) 

!  Pick  tha  laaat  zaoalalag  alaoaet,  put  it  into  Rll 
CJSR  rORC  PICR.1CAST  '0'  ; 

!  rotata  tha  buff ax  pointar  baaa  addxaasas,  xa-initiallxa  boffax  pointaxa 
;  R12  a  PASS(R5) 

;  R5  a  PASS(R3) 

;  R3  a  PASS(R1) 

;  R1  a  PASS(R12) 

;  R6  a  PASS(R5) 

;  R4  a  PASS(R3) 

!  Itarata  tha  ataady-stata  loop  P-3  tiaaa 

CJSR  rORC  MFJjOOe  'P-4'  ;  R2  a  PASS(R1) 

;  R7  a  PASS (RIO) 

SPX  1R2  '2'  ;  Z0JST(R11, '0') 

;  R8  a  Z,aP0(R7) 

;  LCJ?OSR^('0' ,R0) 

;  R8  a  PaSS(R7) 

;  [POP]  R9  a  L1»0(R7) 

;  LCJOSHJEQ(R15,RO) 

;  R9  a  PASS(R7) 

CJSR  rORC  SORTJ  '0'  ;  [POP] 

!  Stora  tha  locally  aoxtad  xow  to  aaaozy,  aat  up  local  aadlan  calc: 

;  STORE (R7,R2) 

;  R2  a  ADD('1',R2) 

;  ST(»E(R8,R2) 

;  R2  a  AOD('l',R2) 

;  STORE  (R9,R2) 

;  R2  a  A0D('-1',R2) 

;  R8  a  ]iQAD(R4) 

;  R4  a  ADD('1',R4) 

;  R9  a  JiQAD(R6) 

!  Skip  ovax  tha  foux  laaat  of  tha  9  pixala  in  tha  3  aoxtad  buffaxa: 

CJSR  rORC  SRZPJJEAST  '3'  ;  R6  a  A00('1',R6) 

!  Pick  tha  laaat  xanaining  alaawnt,  put  it  into  Rll 
CJSR  FORC  PldLXEAST  '0'  ; 


Figure  D.8;  (continued  next  page) 


!  rot«t«  tlM  buffer  potntT  baM  addr*«M*,  sa-lnltlalis*  bu<£«r  polntars 
sex  IB2  '2'  ;  lO^OUl, '0') 

;  Itl2  ■  PASS{R5) 

;  R5  X  eASS(R3) 

;  R3  X  PASSCRl) 

;  R1  X  eASS(R12) 

;  R6  X  PASS(R5) 

!  Copy  Ibaelc  into  coxr  and  aat  up  iwadlan  calculation  fox  final  row 
;  R4  X  AD0('2',R3) 

;  R9  X  ZX>X0(R4) 

;  R4  X  AIX)('-1'  ,R4) 

;  R2  X  ]U»)('2',R1) 

;  STORE (R9,R2) 

;  R2  X  ADD('-1',R2)  !  curr  points  to  sacond  alsaiant  of  row 
;  R8  X  LQRD(R4) 

;  R4  X  A0D('-1'  ,R4) 

;  STORE  (R8,R2) 

;  R7  X  LGA0(R4) 

;  R4  X  RDD('1',R4)  !  Iback  points  to  sacond  alsnant  of  row 
;  R8  X  PASS(R7) 

;  R9  X  LQA0(R6) 

CJSR  rORC  SKIPJ[£AST  '3'  ;  R6  x  JU)0('1',R6)  !  2baek  points  to  2nd  alt 
!  Pick  tba  laast  mnaining  alamant,  put  it  into  Rll 
C3SR  rORC  PldUXAST  '0'  ; 

SPX  IR2  '2'  ;  IOJST(Rll, '0') 

BALT  ; 

I 

SORTJ; 

;  LCJ?USBaT(R8,R7) 

;  R12  X  PASS(R7) 

;  R7  X  PASS(R8) 

;  R8  X  PASS(R12) 

;  [P<»]  L(1PUSRJ.T(R9,R7) 

;  R12  X  PASS(R7) 

;  R7  X  PASS(R9) 

;  R9  X  PASS(R12) 

;  [POP]  LCJ?OSai.T(R9,R8) 

;  R12  X  PASS(R8) 

;  R8  X  PASS(R9) 

;  R9  X  PASS(R12) 

LTST  zero  S(B«I.3  ;  [POP] 


Figure  D.8:  (continued  next  page) 
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SKIP.LEAST: 

;  lX:jasa^(R7,R8) 

;  LC^USELLT(R7,R9) 

;  R13  «  ADD('3'  ,R1) 

;  LCfUSH^(R2,R13) 

;  R7  s  PASS(R14)  !  hUc*  plxO  blggar  than  any  plzal 
;  [IMV]  R7  >  L0AD(R2) 

;  R2  s  RD0('l',It2) 

;  [POP] 

;  [IRV]  R13  a  RDD('3' ,R5) 

;  LC-POSUQ(R6,R13) 

;  R9  s  PASS(R14)  !  aaka  plx2  blggax  than  any  plnal 
;  [INV]  R9  ■  LCAD(R6) 

;  R6  31  ADO('l',R6) 

;  [POP] 

;  [POP] 

;  [INV]  IXLPnSHJ:.T(R8,R9) 

;  R13  *  RDD('3',R3) 

;  LC^OSaJEQ(R4,R13) 

;  Rfi  s  PASS(R14)  !  aaka  plxl  blggar  than  any  placal 
;  [INV]  R8  >  LQAO(R4) 

;  R4  3e  ADD('l'  ,R4) 

;  [POT] 

;  [INV]  R13  >RDD('3',R5) 

;  LC^nSELEQ(R6,R13) 

;  R9  K  PRSS(R14)  !  ■aka  pix2  blggar  than  any  plxal 
;  [INV]  R9  «  IA1U>(R6) 

;  R€  s  ADO('l'  ,R6) 

;  [POT] 

;  [POP] 

LTST  ICTO  SKIPJCEAST  ;  [POT] 

PICIU:£AST: 

;  LC^OSH-LT(R7,R8) 

;  LC^USH^T(R7,R9) 

;  Rll  >  PASS(R7) 

;  [INV]  Rll  -  PASS(R9} 

;  [POP] 

;  [INV]  LClPOSaXT(R8,R9) 

;  Rll  »  PASS(R8) 

;  [INV]  Rll  s  PASS(R9) 

;  [POT] 

LTST  ICTO  PICK_IXAST  ;  [POT] 

MF-LOOP: 

!  Intaznallza  naxt  row  of  pixala 
;  R7  =  PASS (RIO) 

SPX  IR2  '2'  ;  IO^(Rll, '0')  !  on  SLAP,  thaaa  two  I/O 

SPX  IRl  '2'  ;  RIO  ^  IO.JJ>('0')  !  opozations  ovazlap  fully. 

;  R8  «  L0P0(R7) 

;  LCJ>OSHJE(2('0',R0) 

;  R8  s  PASS(R7) 
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;  [POP]  R9  -  10HO(R7) 

;  I£;^USHJEQ(1U.5,R0) 

;  R9  >  SASS(R7) 

!  Sort  th«  3  pJjMla  la  tba  natghhorfaood 

CJSR  roec  scer^  'O'  ;  [pop] 

•  Stox«  tha  sorbad  3  alaiaanta  bo  ONinozy,  sab  up  local  a»dlan  caleulabion 
;  ST(»E(R7,B2) 

;  R2  «  A0D('1',R2) 

;  STORE (R8,R2) 

;  R2  «  30D('1',R2) 

;  STORE (R9,R2) 

;  It2  -  M>D('-1'  ,R2) 

;  R8  a  ZORDail) 

;  lU  -  JU)D('1',E4) 

;  PS  a  ZOADCRS) 

!  Skip  osar  bba  four  laasb  of  bha  9  plxals  la  bha  3  aorbad  boffars; 

CJSR  PORC  SKIPJKkST  '3'  ;  R6  a  ADD<'1',R6} 

!  Pick  bha  laasb  ramalning  alsswob,  pub  lb  Inbo  Rll 
CJSR  FORC  PZCK_X£kST  '0'  ; 
i  robaba  bha  buffax  polabars 
;  R12  a  PASS(R5) 

;  R5  a  PASS(R3) 

;  R3  a  PASS(R1) 

;  R1  a  PhSS(R12) 

;  RS  a  PhSS(R5) 

;  R4  a  PASS(R3) 

list  zero  HFJLOOP  ;  R2  a  PASS(R1) 


Figure  D.8:  Assembly  Language  Program  for  xnedlan 
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Appendix  E 

Measured  Fq  and  F2  Speedup  Bounds 


Two  simplest  single-port  I-caches,  Fo  and  F2,  were  designed.  Each  of  these  I-caches  is  capable  of 
storing  only  a  single  cache  block  at  a  time.  The  difference  between  them  is  that  F2  contains  an 
iteration  coimter  and  is  able  to  sequence  through  multiple  iterations  of  a  cache  block  from  a  single 
activation,  whereas  Fq  executes  only  single  iterations  of  cache  blocks. 

These  I-cache  variants  were  evaluated  for  the  sample  programs  through  detailed  simulation  on 
each  of  four  SIMD  computer  variants,  SIMD-A,  B,  C,  and  D.  These  computers  differ,  for  example,  in 
PE  datapath  widths  and  in  nmnbers  of  PEs  per  PE  chip.  In  the  simulations,  operation  stepcount 
parameters  were  used  to  express  the  number  of  clock  <7cles  reqxiired  to  perform  each  operation 
on  problem  data.  The  following  table  summarizes  the  characteristics  of  the  four  SIMD  computer 
variants,  including  typical  stepcoxmts  for  32-bit  problem  data: 


SIMD-A 

SIMD-B 

SIMD-G 

SIMD-D 

PEs  per  chip 

128 

32 

4 

2 

BTJ  bit-width 

1 

4 

16 

32 

NOR  stepcoxmt 

32 

8 

2 

1 

ADD  stepcoimt 

32 

8 

2 

1 

MULT  stepcoimt 

1056 

263 

34 

1 

LG_PUSHJEQ  stepcoimt 

32 

8 

2 

1 

LOAD  stepcoimt 

128 

32 

4 

2 

LDNO  stepcount 

32 

8 

2 

1 

Maximum  subsystem  clock  rates  depend  on  wire  geometries  and  electrical  propagation  character¬ 
istics  of  the  implementation  technology.  In  the  simulations,  p-set  parameters  were  used  to  express 
relative  multi-chip  subsystem  (MGS)  dock  rates.  The  PE  dock  rate  is  px  times  higher  than  the  clock 
rate  for  MGS  X.  Highest  I-cache  speedups  are  obtained  where  all  MGSs  other  than  global  instruc¬ 
tion  broadcast  operate  at  the  highest  possible  dock  rates.  The  limitation  arising  from  relatively 
slow  instruction  broadcast  is  most  severe  in  this  case.  Lowest  I-cache  speedups  are  obtained  where 
all  MGSs  operate  at  the  low  rate  of  global  instruction  broadcast.  Subsystem-boundedness  is  most 
severe  in  this  case,  and  I-cache  is  least  advantageous  because  a  greatest  proportion  of  computation 
time  is  spent  waiting  for  MGS  operations  to  complete,  as  opposed  to  waiting  for  globally  broadcast 
instructions  to  arrive  at  the  PE  chips. 

I-cache  speedup  bounds  were  obtained  by  simulating  I-cached  computations  at  the  p-sets  charac¬ 
terizing  the  limiting  extremes  of  relative  MGS  dock  rates.  This  appendix  presents  the  complete  set 
of  measured  I-cache  speedups,  for  each  sample  problem  on  each  SIMD  computer  variant.  In  each 
case,  the  speedup  boiinds  are  plotted  against  values  of  pb  ranging  from  1  to  16. 
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Superimposed  on  each  set  of  measvired  speedup  bounds  is  a  cuurve  of  the  form 

(1  C)  Pb 

Ph-^C 

This  speedup  formula,  derived  in  Section  5.10,  is  an  approximation  of  the  speedup  of  program  single, 
consisting  of  one  loop.  C  represents  the  product  of  the  firaction  of  instructions  in  single  that  lie 
within  the  loop  times  the  ntimber  of  loop  iterations.  The  closeness  of  fit  of  this  approximation  to  the 
I-cache  speedups  for  the  sample  programs  is  remarkable,  in  view  of  their  varying  and  sometimes 
complex  loop  structures.  The  one  sigpuficant  departure  firom  a  good  fit  occurs  for  F2  for  the  sample 
problem  rowcol,  due  to  poor  cache  management  used  in  that  case.  For  F2  speedup  bounds  on 
xowcol,  the  measured  values  are  fit  to  a  curve  of  the  form 

(1  ♦Opb  _  9Ph 
P\i*  C  Pb  +  C* 

where  the  parameter  g  expresses  the  fiiaction  of  instructions  in  simple  that  lie  within  the  loop.  The 
second  term  in  this  formula  represents  an  I-cache  penalty. 

The  graphs  on  the  following  pages  are  plotted  to  scales  indicated  at  the  bottom  of  each  left-hand 
(even-numbered)  page. 
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Figure  E.l:  Fq-  and  F2-Speedup  Bounds  for  the  program  median  on  SIMD-A,  B,  C,  &  D. 

The  I-cache  speedups  are  low  because  of  the  program’s  complex  loop  structure.  To 
get  good  speedup  for  this  program,  an  I-cache  variant  needs  to  be  able  to  store 
multiple  cache  blocks  at  once.  Fq  and  F2  are  capable  of  storing  only  a  single  cache 
block  at  a  time,  so  the  cache  blocks  tend  to  thrash  these  simple  I-cache  variants. 


median 


Figure  E.2:  Fg*  and  F2-Speedup  Bounds  for  the  program  SObel  on  SIMD-A,  B,  C,  &  D. 

The  loop  body  of  this  program  is  a  square-root  approximation-refinement  step 
whose  iteration  is  globally  data-dependent.  The  small  differences  between  upper 
and  lower  speedup  boimds  confirm  that  the  repeated  instruction  sequence  is 
extremely  calculation-intensive.  The  modest  I-cache  speedups  for  this  program  are 
due  primarily  to  a  small  total  iteration  count  of  the  loop  body. 


Figure  E.3:  Fg-  and  F2-Speedup  Boxinds  for  the  program  tree  on  SIMD-A,  B,  C,  &  D. 

This  program  consists  of  a  short  prolog  followed  by  single  iterated  loop.  The 
speedup  upper  bounds  are  lower  for  Fg  than  for  F2  because  of  quantization. 
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Figure  E.5:  Fg-  and  F2-Speedup  Bounds  for  the  program  rowcol  on  SIMD-A,  B,  C,  &  D. 

The  row  and  column  sorts  that  are  the  inner  loops  of  this  program  execute  for 
data-dependent  nxunbers  of  iterations.  The  low  F2  speedups  show  that  the 
management  used,  effectively  unrolling  the  loops  hy  4,  is  disadvantageous.  This  poor 
result  illustrates  the  hazards  of  poor  cache  management.  The  F2  results  are  so  poor 
that  the  error  term  (g)  must  be  included  in  the  simple-equivalent  speedup  function  in 
order  to  achieve  a  good  fit. 
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Figure  E.6:  Fg-  and  F2-Speedup  Bounds  for  the  program  bitoniC  on  SIMD-A,  B,  C,  &  D. 

The  number  of  iterations  of  the  inner  loop  varies  on  each  iteration  of  the  outer  loop 
as  a  ftmction  of  the  outer  loop  index  and  the  input  data  set  size.  The  basis  computer’s 
system  controller  lacks  the  ability  to  activate  F2  cache  blocks  for  varying  numbers  of 
iterations.  The  resulting  management  of  F2  I-cache  is  to  iterate  cache  blocks  singly. 
Therefore,  results  obtained  with  F2  are  identical  to  those  obtained  with  Fg. 
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Figure  E.7:  Fq-  and  F2-Speedup  Bounds  for  the  program  bubble  on  SIMD-A,  B,  C,  &  D. 

The  difference  between  the  speedup  upper  bounds  is  less  on  SIMD-A  than  on 
SIMD-D.  The  loop  body  on  SIMD-A  contains  a  large  nxunber  of  instructions,  as  are 
needed  to  control  the  simple  FU.  Each  iteration  of  the  loop  body  is  made  nearly 
times  faster  with  Fq  on  SIMD-A.  On  SIMD-D,  whose  PEs  have  a  more  powerful  FU, 
the  speedup  is  not  so  great  for  single  iterations,  and  quantization  causes  the  Fq 
speedup  upper  boimd  to  be  much  lower  than  the  F2  upper  boxmd. 
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Figure  E.8:  Fg*  and  F2*Speedup  Boxmds  for  the  program  matmul  on  SIMD-A,  B,  C,  &  D. 

The  poor  fits  of  the  simple-equivalent  speedup  curve  to  the  Fg  upper  bound  on 
SIMD-C  and  on  SIMD-D  are  due  to  quantization.  However,  the  fit  to  the  Fj  upper 
bound  on  SIMD-D  is  also  somewhat  poor,  arising  from  a  slowdown  at  Pj,  =  1.  This 
poor  fit  suggests  that  it  is  not  so  appropriate  to  discard  the  simple-equivalent  speedup 
formula  error  term  in  this  case  as  it  is  for  most  of  the  other  cases. 
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Appendix  F 

Summary  of  Fq  Speedup  Bounds 


To  facilitate  comparison  of  the  range  of  I-cache  boimds,  a  complete  set  of  “simple-equivalent”  Fo 
speedup  lower  boxinds  measured  on  each  SIMD  computer  variant  is  plotted  on  one  graph,  and  a 
complete  set  of  “simple-equivalent”  Fo  speedup  upper  bounds  measured  on  each  SIMD  computer 
variant  is  also  plotted  on  one  graph. 


225 


Figure  F^.  A  summaiy  of  Fq  Lower  Bounds  on  the  subject  programs  in  SJMD-A,  B,  C,  &  D. 
Curve  spreads  decrease  as  the  hardware  variant  becomes  more  powerful. 
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APPENDIX  F.  SUMMARY  OF  Fo  SPEEDUP  BOUNDS 
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Appendix  G 

Summary  of  F2  Speedup  Bounds 


lb  facilitate  comparison  of  the  range  of  I-cache  bounds,  a  complete  set  of  “simple-equivalent”  F2 
speedup  lower  bomids  measiured  on  each  SIMD  computer  variant  is  plotted  on  one  graph,  and  a 
complete  set  of  “simple-equivalent”  F2  speedup  upper  bounds  measinred  on  each  SIMD  computer 
variant  is  also  plotted  on  one  graph. 
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Figure  G.2:  A  summaiy  of  Fj  Lower  Bounds  on  the  subject  programs  in  SIMD-A,  B,  C,  &  D. 
SIMD-C  and  D  I-cache  speedups  are  very  similar. 
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